Turkish Image Captioning with Vision Transformer Based Encoders and Text Decoders Görü Dönüştürücü Tabanlı Kodlayıcılar ve Metin Kod Çözücüler ile Türkçe Görüntü Altyazılama

32nd IEEE Conference on Signal Processing and Communications Applications, SIU 2024, Mersin, Türkiye, 15 - 18 Mayıs 2024

Yayın Türü: Bildiri / Tam Metin Bildiri
Doi Numarası: 10.1109/siu61531.2024.10600738
Basıldığı Şehir: Mersin
Basıldığı Ülke: Türkiye
Anahtar Kelimeler: image captioning, image understanding, text decoders, Turkish image captioning, vision transformers
Yıldız Teknik Üniversitesi Adresli: Evet

Özet

Image captioning is defined as the process of describing of images by computer systems automatically. Thus, visual information regarding the content of the images is expressed in textual form. This paper presents a deep learning-based Turkish image captioning study implemented by using vision transformers and text decoders. In the proposed study, images are initially encoded with a vision transformer-based module. Afterwards, the features of the encoded image are normalized by passing them through a feature projection module. In the final stage, image captions are generated via a text decoder block. To test the performance of the Turkish image captioning system presented in this paper, TasvirEt, a benchmark dataset consisting of Turkish image captions, was used. In the tests performed, quite successful results were observed and a BLEU-1 value of 0.3406, a BLEU-2 value of 0.2110, a BLEU-3 value of 0.1253, a BLEU-4 value of 0.0690, a METEOR value of 0.1610, a ROUGE-L value of 0.3145 and a CIDEr value of 0.3879 were measured.