Enhanced speech emotion recognition using averaged valence arousal dominance mapping and deep neural networks

Rizhinashvili, Davit; Sham, Abdallah; Anbarjafari, Gholamreza

doi:10.1007/s11760-024-03406-8

Enhanced speech emotion recognition using averaged valence arousal dominance mapping and deep neural networks

Atıf İçin Kopyala

Rizhinashvili D., Sham A. H., Anbarjafari G.

Signal, Image and Video Processing, cilt.18, sa.10, ss.7445-7454, 2024 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 18 Sayı: 10
Basım Tarihi: 2024
Doi Numarası: 10.1007/s11760-024-03406-8
Dergi Adı: Signal, Image and Video Processing
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, zbMATH
Sayfa Sayıları: ss.7445-7454
Anahtar Kelimeler: Deep neural networks, LSTM, Speech analysis, Speech emotion recognition, Valence arousal dominance
Yıldız Teknik Üniversitesi Adresli: Hayır

Özet

This study delves into advancements in speech emotion recognition (SER) by establishing a novel approach for emotion mapping and prediction using the Valence-Arousal-Dominance (VAD) model. Central to this research is the creation of reliable emotion-to-VAD mappings, achieved by averaging outcomes from multiple pre-trained networks applied to the RAVDESS dataset. This approach adeptly resolves prior inconsistencies in emotion-to-VAD mappings and establishes a dependable framework for SER. The study also introduces a refined SER model, integrating the pre-trained Wave2Vec 2.0 with Long Short-Term Memory (LSTM) networks and linear layers, culminating in an output layer representing valence, arousal, and dominance. Notably, this model exhibits commendable accuracy across various datasets, such as RAVDESS, EMO-DB, CREMA-D, and TESS, thereby showcasing its robustness and adaptability, an improvement over earlier models susceptible to dataset-specific overfitting. The research further unveils a comprehensive speech analysis application, adept at denoising, segmenting, and profiling emotions in speech segments. This application features interactive emotion tracking and sentiment reports, illustrating its practicality in diverse applications. The study recognizes ongoing challenges in SER, especially in managing the subjective nature of emotion perception and integrating multimodal data. Although the research marks a progression in SER technology, it underscores the need for continuous research and careful consideration of ethical aspects in deploying such technologies. This work contributes to the SER domain by introducing a dependable method for emotion mapping, a robust model for emotion recognition, and a user-friendly application for practical implementations.