Turkish-e5: E5 Model Enhanced for Turkish with Multi-Positive Contrastive Learning Turkish-e5: oklu Pozitif rneklemeli Kar sitsal ?grenme ile T rk esi G lendirilmi s E5 Modeli

Izdas T., Sancak O., KESGİN H. T., YÜCE M. K., AMASYALI M. F.

33rd IEEE Conference on Signal Processing and Communications Applications, SIU 2025, İstanbul, Türkiye, 25 - 28 Haziran 2025, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Doi Numarası: 10.1109/siu66497.2025.11112375
Basıldığı Şehir: İstanbul
Basıldığı Ülke: Türkiye
Anahtar Kelimeler: contrastive learning, embedding model, MTEB, multi-positive sampling, retrieval
Yıldız Teknik Üniversitesi Adresli: Evet

Özet

In this study, the multilingual embedding model intfloat/multilingual-e5-large-instruct was fine-tuned for Turkish retrieval tasks using the multi-positive sampling approach. In traditional fine-tuning processes, each query is typically associated with only a single correct answer (positive sample). However, in this study, a dataset was constructed where each query is paired with both its direct answer and the contextual content containing that answer. This approach enables the model to improve the retrieval process by understanding both direct responses and relevant information within the context. The model's performance was evaluated using MTEB benchmark tests as well as examples drawn from three different independent datasets. Experimental results indicate a significant improvement in the retrieval performance of the fine-tuned model. Notably, a substantial increase was observed in the R@1 metric, which measures the rate at which the best answer is ranked first, along with significant enhancements in MTEB results, demonstrating improved retrieval accuracy.