Deep Learning Based Sign Language Recognition Using Efficient Multi-Feature Attention Mechanism

Yenisari, Esma; YAVUZ, Sırma

doi:10.1109/access.2025.3586096

Deep Learning Based Sign Language Recognition Using Efficient Multi-Feature Attention Mechanism

Yenisari E., YAVUZ S.

IEEE Access, cilt.13, ss.126684-126699, 2025 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 13
Basım Tarihi: 2025
Doi Numarası: 10.1109/access.2025.3586096
Dergi Adı: IEEE Access
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, Directory of Open Access Journals
Sayfa Sayıları: ss.126684-126699
Anahtar Kelimeler: Attention mechanism, computer vision, deep learning, sign language recognition, SLR datasets, vision-based recognition
Yıldız Teknik Üniversitesi Adresli: Evet

Özet

Sign language is a communication system used by hearing-impaired people and serves as a bridge between hearing and deaf community. Since sign language uses numerous visuomotor elements that include both visual perception (hand shapes, facial expressions) and physical movements (hand and arm movements), it represents a multimodal input source for Sign Language Recognition (SLR) systems. In this study, a new deep learning-based architecture using EfficientNet and multi-feature attention mechanism is proposed to accurately recognize SL gestures. Initially, general visual features are acquired through the EfficientNet model, leveraging the transfer learning paradigm. Subsequently, dataset-specific contextual features are extracted utilizing distinct network types; spatial dependencies are modeled via Convolutional Neural Networks (CNNs), while temporal dynamics are learned through Recurrent Neural Networks (RNNs). These features are adaptively weighted using attention mechanism and focus on the most critical information for the classification task. This approach ensures that the most information-rich and useful components of both methods are emphasized, leading to a significant increase in final success performance. Utilizing RGB video images, the proposed model, on the BosphorusSign22k General dataset comprising Turkish Sign Language (TSL) words, achieved accuracies of 99.21% and 96.84% for word classes of 50 and 174, respectively. Furthermore, the generalization ability of the model is proven by its high accuracy of 99.94% in the Argentinian Sign Language dataset (LSA64) and 98.41% in the Indian Sign Language dataset (INCLUDE50). Experimental results indicate that the proposed model architecture has a competitive performance compared to existing SLR models reviewed in the literature.