Creating a Large Clean Web Corpus for Turkish

Uzun M. E., Erdem Y., Izdas T., Sancak O., Zeer A., Ince E., ...Daha Fazla

4th IEEE International Conference on Computing and Machine Intelligence, ICMI 2025, Michigan, Amerika Birleşik Devletleri, 5 - 06 Nisan 2025, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Doi Numarası: 10.1109/icmi65310.2025.11141276
Basıldığı Şehir: Michigan
Basıldığı Ülke: Amerika Birleşik Devletleri
Anahtar Kelimeler: Corpus Construction, Data Cleaning, Large Language Models (LLM), Text Preprocessing, Web Data Filtering
Yıldız Teknik Üniversitesi Adresli: Evet

Özet

In this study, it is aimed to create a high-quality dataset to improve the performance of Turkish language models and to investigate the effect of this dataset on language model training. The irregular, context-free and noisy structure of webbased data can negatively affect the success of large language models. To address this issue, a comprehensive cleaning and filtering process has been performed on the existing CulturaX dataset. In this process, page-based and content-based filtering methods have been applied to make the dataset more consistent and meaningful. In addition, supervised machine learning models have been trained using various feature sets such as FastText and BERT embeddings, and the contributions of these feature sets to model performance have been compared. Experimental results have shown that the model that uses FastText embeddings and heuristic features together has achieved the highest accuracy and F1 scores. The resulting 125 GB cleaned Turkish web corpus provided lower perplexity and higher accuracy rates in the training of language models. This study provides a significant contribution to the development of more reliable and effective large language models for Turkish, thus providing a solid foundation for use in language model research.