32nd IEEE Conference on Signal Processing and Communications Applications, SIU 2024, Mersin, Türkiye, 15 - 18 Mayıs 2024
Image captioning is defined as the process of describing of images by computer systems automatically. Thus, visual information regarding the content of the images is expressed in textual form. This paper presents a deep learning-based Turkish image captioning study implemented by using vision transformers and text decoders. In the proposed study, images are initially encoded with a vision transformer-based module. Afterwards, the features of the encoded image are normalized by passing them through a feature projection module. In the final stage, image captions are generated via a text decoder block. To test the performance of the Turkish image captioning system presented in this paper, TasvirEt, a benchmark dataset consisting of Turkish image captions, was used. In the tests performed, quite successful results were observed and a BLEU-1 value of 0.3406, a BLEU-2 value of 0.2110, a BLEU-3 value of 0.1253, a BLEU-4 value of 0.0690, a METEOR value of 0.1610, a ROUGE-L value of 0.3145 and a CIDEr value of 0.3879 were measured.