TRCaptionNet++: A high-performance encoder-decoder based deep Turkish image captioning model fine-tuned with a large-scale set of pretrain data


Yildiz S., MEMİŞ A., VARLI S.

Turkish Journal of Electrical Engineering and Computer Sciences, cilt.33, sa.5, ss.669-687, 2025 (SCI-Expanded, Scopus, TRDizin) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 33 Sayı: 5
  • Basım Tarihi: 2025
  • Doi Numarası: 10.55730/1300-0632.4150
  • Dergi Adı: Turkish Journal of Electrical Engineering and Computer Sciences
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, Applied Science & Technology Source, Compendex, INSPEC, TR DİZİN (ULAKBİM)
  • Sayfa Sayıları: ss.669-687
  • Anahtar Kelimeler: deep learning, Image captioning, image encoders, text decoders, Turkish image captioning
  • Yıldız Teknik Üniversitesi Adresli: Evet

Özet

This paper introduces a novel and high-performance encoder-decoder-based deep model called TRCaptionNet++ for generic Turkish image captioning tasks. The proposed model is an improved and refined version of TRCaptionNet, which essentially employs a CLIP (contrastive language–image pretraining) image encoder, a feature projection layer and a BERT (bidirectional encoder representations from transformers) text decoder. Within the scope of the study, the regular TRCaptionNet model was trained and specifically fine-tuned with a massive set of image data. In this respect, approximately 2,000,000 random images representing the words in the MS COCO and Flickr caption sets were retrieved through web crawling in the initial stage. Then, nearly 8,000,000 caption texts were generated for each image via 4 different image captioning models. Finally, the text decoder module of the proposed model was improved by using the image-caption features of these crawled images. The performance evaluation test of the TRCaptionNet++ model was carried out on two Turkish caption datasets (TasvirEt and Turkish MS COCO) and two machine-translated caption sets (MS COCO and Flickr30K) by measuring common image captioning metrics such as BLEU, METEOR, ROUGE-L, CIDEr and SPICE. As a result of the performance tests, quite remarkable captioning success rates were achieved and it is observed that the proposed model has a superior performance outperforming all the related works. Project details and demo links of TRCaptionNet++ will also be available on the project’s page https://serdaryildiz.com/TRCaptionNetpp.