TRCaptionNet: A novel and accurate deep Turkish image captioning model with vision transformer based image encoders and deep linguistic text decoders

Yildiz S., Memış A., VARLI S.

Turkish Journal of Electrical Engineering and Computer Sciences, vol.31, no.6, pp.1079-1098, 2023 (SCI-Expanded) identifier identifier

  • Publication Type: Article / Article
  • Volume: 31 Issue: 6
  • Publication Date: 2023
  • Doi Number: 10.55730/1300-0632.4035
  • Journal Name: Turkish Journal of Electrical Engineering and Computer Sciences
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, Applied Science & Technology Source, Compendex, INSPEC, TR DİZİN (ULAKBİM)
  • Page Numbers: pp.1079-1098
  • Keywords: bidirectional encoder representations from transformers, contrastive language–image pretraining, image, Image captioning, image understanding, natural language processing, Turkish image captioning
  • Yıldız Technical University Affiliated: Yes


Image captioning is known as a fundamental computer vision task aiming to figure out and describe what is happening in an image or image region. Through an image captioning process, it is ensured to describe and define the actions and the relations of the objects within the images. In this manner, the contents of the images can be understood and interpreted automatically by visual computing systems. In this paper, we proposed the TRCaptionNet a novel deep learning-based Turkish image captioning (TIC) model for the automatic generation of Turkish captions. The model we propose essentially consists of a basic image encoder, a feature projection module based on vision transformers, and a text decoder. In the first stage, the system encodes the input images via the CLIP (contrastive language–image pretraining) image encoder. The CLIP image features are then passed through a vision transformer and the final image features to be linked with the textual features are obtained. In the last stage, a deep text decoder exploiting a BERT (bidirectional encoder representations from transformers) based model is used to generate the image cations. Furthermore, unlike the related works, a natural language-based linguistic model called NLLB (No Language Left Behind) was employed to produce Turkish captions from the original English captions. Extensive performance evaluation studies were carried out and widely known image captioning quantification metrics such as BLEU, METEOR, ROUGE-L, and CIDEr were measured for the proposed model. Within the scope of the experiments, quite successful results were observed on MS COCO and Flickr30K datasets, two known and prominent datasets in this field. As a result of the comparative performance analysis by taking the existing reports in the current literature on TIC into consideration, it was witnessed that the proposed model has superior performance and outperforms the related works on TIC so far. Project details and demo links of TRCaptionNet will also be available on the project’s GitHub page (