A Turkish Dataset and BERTurk-Contrastive Model for Semantic Textual Similarity


Dehghan S., AMASYALI M. F.

Journal of Information Systems and Telecommunication, cilt.13, sa.1, ss.24-32, 2025 (Scopus) identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 13 Sayı: 1
  • Basım Tarihi: 2025
  • Dergi Adı: Journal of Information Systems and Telecommunication
  • Derginin Tarandığı İndeksler: Scopus
  • Sayfa Sayıları: ss.24-32
  • Anahtar Kelimeler: BERT, BERTurk, Contrastive Learning, Deep Learning, Semantic Textual Similarity, Turkish Language
  • Yıldız Teknik Üniversitesi Adresli: Evet

Özet

Semantic Textual Similarity (STS) is an important NLP task that measures the degree of semantic equivalence between two texts, even if the sentence pairs contain different words. While extensively studied in English, STS has received limited attention in Turkish. This study introduces BERTurk-contrastive, a novel BERT-based model leveraging contrastive learning to enhance the STS task in Turkish. Our model aims to learn representations by bringing similar sentences closer together in the embedding space while pushing dissimilar ones farther apart. To support this task, we release SICK-tr, a new STS dataset in Turkish, created by translating the English SICK dataset. We evaluate our model on STSb-tr and SICK-tr, achieving a significant improvement of 5.92 points over previous models. These results establish BERTurk-contrastive as a robust solution for STS in Turkish and provide a new benchmark for future research.