UNSUPERVISED HUMAN EMOTION RECOGNITION VIA BODY POSE DYNAMICS BASED ON SELF-SUPERVISED CONTRASTIVE LEARNING
DOI:
https://doi.org/10.28925/2663-4023.2026.32.1006Keywords:
unsupervised learning, emotion recognition, contrastive learning, human pose estimation, CNN, LSTM, NT-Xent loss, real-timeAbstract
The article presents the results of a study aimed at solving the current problem of automated emotion recognition in the absence of large amounts of labeled data. The core idea of the work lies in the use of an unsupervised approach to training neural networks, which allows for the detection of emotional patterns directly from the geometry and kinetics of the human body. The introductory part substantiates the need for a transition from classical supervised learning methods to self-supervised learning approaches, driven by the high cost and subjectivity of manual emotional state labeling. The object, subject, and goal of the study are defined, focusing on creating a fast and accurate system for recognizing seven basic emotions in a video stream.
The literature review section demonstrates that existing solutions (OpenPose, MoveNet) successfully solve the task of pose estimation; however, their application for affective state analysis is usually limited by the need for massive datasets. A scientific gap has been identified regarding insufficient attention to the unsupervised learning of the emotional component of movements. The methodology section describes in detail the proposed hybrid architecture, which combines a Convolutional Neural Network (CNN) for spatial pose encoding and Recurrent Neural Network blocks (LSTM) for temporal dynamics analysis. A key element of the methodology is the implementation of the SimCLR framework, where training occurs through the minimization of the NT-Xent contrastive loss function. The mathematical influence of the temperature parameter $\tau$ on the model's ability to distinguish visually similar poses (Hard Negatives) is substantiated, ensuring high-quality feature formation in the latent space.
The experimental part of the article contains a description of the testing process based on international datasets (RAVDESS, CK+). It outlines the stages of video preprocessing using MediaPipe Holistic, coordinate normalization, and the creation of positive data pairs for contrastive learning. The experimental results confirmed the high efficiency of the method: using only 10% of labeled data for final fine-tuning achieved an accuracy of 78.5%, which is comparable to the performance of full supervised learning. Particular attention is paid to the system's performance, which is 42–45 FPS, confirming the possibility of its real-time use. The conclusions summarize the scientific novelty of the work, which lies in the adaptation of contrastive learning methods for the task of emotional body kinetics, and outline the practical prospects for implementing the development in the fields of social robotics, security systems, and human-machine interfaces.
Downloads
References
Anandan, P., & Karthik, S. (2022). Comparative analysis of lightweight OpenPose and MoveNet AI models for real-time fall detection and alert systems. Sensors and Materials, 34(11), 4057–4072. https://doi.org/10.18494/SAM3994
Bhattacharya, U., Ronchi, C., Machlev, K., Xu, R., Han, S., & Manocha, D. (2021). Pose-SCLR: Self-supervised contrastive learning of skeleton representations for emotion recognition. In Proceedings of the 25th International Conference on Pattern Recognition (ICPR) (pp. 5608–5615). IEEE. https://doi.org/10.1109/ICPR48806.2021.9412128
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning (ICML) (Vol. 119, pp. 1597–1607).
Choutas, V., Pavlakos, G., Ng, M. J., Gulati, A., & Tzionas, D. (2022). ElePose: Unsupervised 3D human pose estimation by predicting camera elevation and learning normalizing flows on 2D poses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1312–1322).
Ding, Z., Han, K., & Zhou, W. (2022). Improving unsupervised label propagation for pose tracking and video object segmentation. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 214–231).
Jakab, T., Gupta, A., Bilen, H., & Radig, B. (2020). Unsupervised human pose estimation through transforming shape templates. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2476–2486).
Khan, M. A., et al. (2021). Unsupervised machine learning to detect abnormal activities using CNN and 3D spatial-temporal autoencoder (3DSTAE). IEEE Access, 9, 87431–87445.
Kundu, A. S., et al. (2022). Self-supervised 3D human pose estimation from a single image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9), 5234–5248.
Lee, J., et al. (2023). Spatio-temporal graph convolutional networks vs CNN-LSTM for emotion recognition: A comparative study. Journal of Artificial Intelligence Research, 76, 441–465.
Livingstone, S. R., & Russo, F. A. (2018). The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLOS ONE, 13(5), Article e0196391. https://doi.org/10.1371/journal.pone.0196391
Rao, H., et al. (2021). Contrastive learning for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1913–1922).
Wang, N., Zhou, W., & Li, H. (2021). Unsupervised deep representation learning for real-time tracking. International Journal of Computer Vision, 129, 547–565.
Yadav, S. K., Singh, K., & Sharma, N. K. (2024). Real-time human pose estimation and tracking on monocular videos: A systematic literature review. Multimedia Tools and Applications, 83, 1245–1289.
Zhao, M., Adib, F., & Katabi, D. (2021). Emotion recognition using wireless signals and CNN-LSTM networks. IEEE Transactions on Affective Computing, 12(1), 75–88. https://doi.org/10.1109/TAFFC.2018.2855212
Zinchenko, O. V., Zvenihorodskyi, O. S., & Kysil, T. M. (2022). Convolutional neural networks for solving computer vision problems. Telecommunication and Information Technologies, (2), 4–12. https://tit.dut.edu.ua/index.php/telecommunication/article/view/2417
Zinchenko, O. V., & Kysil, T. M. (2025). Convolutional neural networks for moving object analysis in video streams. Zviazok, (4), 48–57. https://doi.org/10.31673/2412-9070.2025.042042
Kysil, T. M. (2025, December 11). CNN-LSTM approach to real-time emotion recognition based on pose. In Proceedings of the International Scientific and Practical Conference “Modern Achievements of Hewlett Packard Enterprise in IT and New Opportunities for Their Study and Application” (pp. 110–112).
Kovalchuk, O. V. (Ed.). (2022). Methods and technologies of semi-supervised learning: Lecture course. Naukova Dumka.
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Тетяна Кисіль, Ольга Зінченко

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.