A Multimodal Transformer-Based Framework for Emotion Analysis in Multilingual Video Content †


Yakut S., Tuten Y. T., Caglar E., AKTAŞ M. S.

Computers, cilt.15, sa.2, 2026 (ESCI, Scopus) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 15 Sayı: 2
  • Basım Tarihi: 2026
  • Doi Numarası: 10.3390/computers15020077
  • Dergi Adı: Computers
  • Derginin Tarandığı İndeksler: Emerging Sources Citation Index (ESCI), Scopus, Aerospace Database, Compendex, INSPEC, Directory of Open Access Journals
  • Anahtar Kelimeler: anxiety, boredom, CNN, cognitive load, computer vision, deep learning, emotion recognition, facial expression recognition (FER), fatigue, stress, transformer
  • Yıldız Teknik Üniversitesi Adresli: Evet

Özet

This research addresses the challenge of inferring complex psychological states, including stress, fatigue, anxiety, cognitive load, and boredom, from facial expressions. We propose an interpretable, literature-informed emotion-weighting methodology that transforms the eight-emotion probability outputs of facial emotion recognition models into continuous estimates of these five psychological states using weights derived from the Valence–Arousal framework, providing a principled bridge between discrete emotion predictions and higher-level affective constructs. The proposed formulation is evaluated across six representative deep learning architectures—a baseline CNN (ResNet-50), a modern CNN (ConvNeXt), a hybrid attention-based model (DDAMFN), and three Transformer-based models (ViT, BEiT, and Swin). Our results demonstrate that strong performance on discrete FER tasks does not directly translate to consistent behavior in complex state inference; instead, architectures capable of preserving subtle and distributed affective cues yield more stable and interpretable state estimates, with DDAMFN and Vision Transformer models exhibiting the most consistent performance across the evaluated psychological states. These findings highlight the central role of the proposed emotion-weighting formulation and the importance of architecture selection beyond categorical accuracy in complex affective state analysis.