Knowledge-Based Systems, cilt.328, 2025 (SCI-Expanded)
Large language models (LLMs) depend on vast web-scale datasets, which frequently include noisy or low-quality samples that degrade performance and fairness-despite conventional data cleaning. This paper introduces an in-training filtering approach that selectively ignores noisy data points based on real-time loss statistics during training. The approach combines deterministic and probabilistic selection mechanisms using robust loss-based metrics and cyclically adjusted thresholds to balance stability and diversity. Evaluations on Turkish-language datasets demonstrate that this strategy reduces validation loss and improves downstream task accuracy without any preprocessing. By integrating filtering directly into the training loop, the method maintains data diversity, requires minimal overhead, and improves learning efficiency-offering a scalable alternative for robust LLM pretraining in noisy or low-resource environments.