Improving Information Retrieval in Turkish: A ColBERT-Based Approach

Saoud A., Kazerooni P., AMASYALI M. F., KESGİN H. T.

3rd International Congress of Electrical and Computer Engineering, ICECENG 2024, Bandirma, Türkiye, 27 - 30 Kasım 2024, ss.85-91, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Doi Numarası: 10.1007/978-3-031-88999-8_7
Basıldığı Şehir: Bandirma
Basıldığı Ülke: Türkiye
Sayfa Sayıları: ss.85-91
Anahtar Kelimeler: BERT, ColBERT, Deep learning, Machine learning, Monolingual models, Natural language processing
Yıldız Teknik Üniversitesi Adresli: Evet

Özet

This study introduces a significant advancement in Turkish information retrieval by developing and fine-tuning monolingual Cosmos BERT models within the ColBERT architecture. Using a translated version of the MS MARCO dataset, we created and evaluated Tiny and Base Cosmos BERT models tailored specifically for Turkish. The research addresses challenges unique to the Turkish language, such as tokenization limitations and reduced performance in multilingual models. This research leverages Cosmos BERT models to demonstrate enhanced retrieval capabilities. Performance comparisons against widely used multilingual models reveal substantial improvements, particularly in Recall and Mean Reciprocal Rank (MRR) metrics, emphasizing the effectiveness of monolingual models for complex languages like Turkish. This work highlights the advantages of adopting monolingual retrieval-augmented generation systems, providing both academic and industrial communities with a powerful tool for enhancing natural language processing applications for the Turkish language. The results underscore the importance of language-specific models in achieving precise and relevant information retrieval in non-English languages.