Turkish Scene Text Recognition with a Lightweight and Robust Transformer Hafif ve G rb z D n st r c ile T rk e Sahne Metni Tanima


Yildiz S.

33rd IEEE Conference on Signal Processing and Communications Applications, SIU 2025, İstanbul, Turkey, 25 - 28 June 2025, (Full Text) identifier

  • Publication Type: Conference Paper / Full Text
  • Doi Number: 10.1109/siu66497.2025.11111830
  • City: İstanbul
  • Country: Turkey
  • Keywords: optical character recognition, scene text recognition, vision transformer
  • Yıldız Technical University Affiliated: Yes

Abstract

In this study, we propose two lightweight vision transformers, ViT-TR-Tiny and ViT-TR-Nano, for scene text recognition. These models achieve the optimal balance of recognition accuracy and computational efficiency by significantly reducing overall network complexity. Experimental results show that the proposed models achieve competitive word accuracy with only minor accuracy degradation when compared to well-known approaches in the literature. Remarkably, the TensorRT-optimized ViT-TR-Tiny achieved 93.44% word accuracy on STRIT and 92.78% on TS-TR while processing 2264 images per second. These findings highlight the promise of efficient transformer-based architectures for tackling complex scene text recognition tasks, particularly in Turkish.