A Turkish Dataset and BERTurk-Contrastive Model for Semantic Textual Similarity


Dehghan S., AMASYALI M. F.

Journal of Information Systems and Telecommunication, vol.13, no.1, pp.24-32, 2025 (Scopus) identifier

  • Publication Type: Article / Article
  • Volume: 13 Issue: 1
  • Publication Date: 2025
  • Journal Name: Journal of Information Systems and Telecommunication
  • Journal Indexes: Scopus
  • Page Numbers: pp.24-32
  • Keywords: BERT, BERTurk, Contrastive Learning, Deep Learning, Semantic Textual Similarity, Turkish Language
  • Yıldız Technical University Affiliated: Yes

Abstract

Semantic Textual Similarity (STS) is an important NLP task that measures the degree of semantic equivalence between two texts, even if the sentence pairs contain different words. While extensively studied in English, STS has received limited attention in Turkish. This study introduces BERTurk-contrastive, a novel BERT-based model leveraging contrastive learning to enhance the STS task in Turkish. Our model aims to learn representations by bringing similar sentences closer together in the embedding space while pushing dissimilar ones farther apart. To support this task, we release SICK-tr, a new STS dataset in Turkish, created by translating the English SICK dataset. We evaluate our model on STSb-tr and SICK-tr, achieving a significant improvement of 5.92 points over previous models. These results establish BERTurk-contrastive as a robust solution for STS in Turkish and provide a new benchmark for future research.