Improving Information Retrieval in Turkish: A ColBERT-Based Approach


Saoud A., Kazerooni P., AMASYALI M. F., KESGİN H. T.

3rd International Congress of Electrical and Computer Engineering, ICECENG 2024, Bandirma, Turkey, 27 - 30 November 2024, pp.85-91, (Full Text) identifier

  • Publication Type: Conference Paper / Full Text
  • Doi Number: 10.1007/978-3-031-88999-8_7
  • City: Bandirma
  • Country: Turkey
  • Page Numbers: pp.85-91
  • Keywords: BERT, ColBERT, Deep learning, Machine learning, Monolingual models, Natural language processing
  • Yıldız Technical University Affiliated: Yes

Abstract

This study introduces a significant advancement in Turkish information retrieval by developing and fine-tuning monolingual Cosmos BERT models within the ColBERT architecture. Using a translated version of the MS MARCO dataset, we created and evaluated Tiny and Base Cosmos BERT models tailored specifically for Turkish. The research addresses challenges unique to the Turkish language, such as tokenization limitations and reduced performance in multilingual models. This research leverages Cosmos BERT models to demonstrate enhanced retrieval capabilities. Performance comparisons against widely used multilingual models reveal substantial improvements, particularly in Recall and Mean Reciprocal Rank (MRR) metrics, emphasizing the effectiveness of monolingual models for complex languages like Turkish. This work highlights the advantages of adopting monolingual retrieval-augmented generation systems, providing both academic and industrial communities with a powerful tool for enhancing natural language processing applications for the Turkish language. The results underscore the importance of language-specific models in achieving precise and relevant information retrieval in non-English languages.