Creating a Large Clean Web Corpus for Turkish


Uzun M. E., Erdem Y., Izdas T., Sancak O., Zeer A., Ince E., ...More

4th IEEE International Conference on Computing and Machine Intelligence, ICMI 2025, Michigan, United States Of America, 5 - 06 April 2025, (Full Text) identifier

  • Publication Type: Conference Paper / Full Text
  • Doi Number: 10.1109/icmi65310.2025.11141276
  • City: Michigan
  • Country: United States Of America
  • Keywords: Corpus Construction, Data Cleaning, Large Language Models (LLM), Text Preprocessing, Web Data Filtering
  • Yıldız Technical University Affiliated: Yes

Abstract

In this study, it is aimed to create a high-quality dataset to improve the performance of Turkish language models and to investigate the effect of this dataset on language model training. The irregular, context-free and noisy structure of webbased data can negatively affect the success of large language models. To address this issue, a comprehensive cleaning and filtering process has been performed on the existing CulturaX dataset. In this process, page-based and content-based filtering methods have been applied to make the dataset more consistent and meaningful. In addition, supervised machine learning models have been trained using various feature sets such as FastText and BERT embeddings, and the contributions of these feature sets to model performance have been compared. Experimental results have shown that the model that uses FastText embeddings and heuristic features together has achieved the highest accuracy and F1 scores. The resulting 125 GB cleaned Turkish web corpus provided lower perplexity and higher accuracy rates in the training of language models. This study provides a significant contribution to the development of more reliable and effective large language models for Turkish, thus providing a solid foundation for use in language model research.