Turkish Scene Text Recognition with a Lightweight and Robust Transformer Hafif ve G rb z D n st r c ile T rk e Sahne Metni Tanima

Yildiz S.

33rd IEEE Conference on Signal Processing and Communications Applications, SIU 2025, İstanbul, Türkiye, 25 - 28 Haziran 2025, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Doi Numarası: 10.1109/siu66497.2025.11111830
Basıldığı Şehir: İstanbul
Basıldığı Ülke: Türkiye
Anahtar Kelimeler: optical character recognition, scene text recognition, vision transformer
Yıldız Teknik Üniversitesi Adresli: Evet

Özet

In this study, we propose two lightweight vision transformers, ViT-TR-Tiny and ViT-TR-Nano, for scene text recognition. These models achieve the optimal balance of recognition accuracy and computational efficiency by significantly reducing overall network complexity. Experimental results show that the proposed models achieve competitive word accuracy with only minor accuracy degradation when compared to well-known approaches in the literature. Remarkably, the TensorRT-optimized ViT-TR-Tiny achieved 93.44% word accuracy on STRIT and 92.78% on TS-TR while processing 2264 images per second. These findings highlight the promise of efficient transformer-based architectures for tackling complex scene text recognition tasks, particularly in Turkish.