Turkish Image Captioning with Vision Transformer Based Encoders and Text Decoders Görü Dönüştürücü Tabanlı Kodlayıcılar ve Metin Kod Çözücüler ile Türkçe Görüntü Altyazılama


Yıldız S., MEMİŞ A., VARLI S.

32nd IEEE Conference on Signal Processing and Communications Applications, SIU 2024, Mersin, Türkiye, 15 - 18 Mayıs 2024 identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Doi Numarası: 10.1109/siu61531.2024.10600738
  • Basıldığı Şehir: Mersin
  • Basıldığı Ülke: Türkiye
  • Anahtar Kelimeler: image captioning, image understanding, text decoders, Turkish image captioning, vision transformers
  • Yıldız Teknik Üniversitesi Adresli: Evet

Özet

Image captioning is defined as the process of describing of images by computer systems automatically. Thus, visual information regarding the content of the images is expressed in textual form. This paper presents a deep learning-based Turkish image captioning study implemented by using vision transformers and text decoders. In the proposed study, images are initially encoded with a vision transformer-based module. Afterwards, the features of the encoded image are normalized by passing them through a feature projection module. In the final stage, image captions are generated via a text decoder block. To test the performance of the Turkish image captioning system presented in this paper, TasvirEt, a benchmark dataset consisting of Turkish image captions, was used. In the tests performed, quite successful results were observed and a BLEU-1 value of 0.3406, a BLEU-2 value of 0.2110, a BLEU-3 value of 0.1253, a BLEU-4 value of 0.0690, a METEOR value of 0.1610, a ROUGE-L value of 0.3145 and a CIDEr value of 0.3879 were measured.