Knowledge-Based Systems, cilt.343, 2026 (SCI-Expanded, Scopus)
Transformer-based large language models are typically trained on large text corpora using the next-token cross-entropy (CE) objective. Although CE is scalable and stable, in practice it can exhibit limitations such as overconfidence, weak learning signals on hard/rare tokens, and a mismatch between the training objective and generation-time behavior. In this work, we propose Knowledge-Filtered Phase Training (KFPT), a two-phase scheme that strengthens the training signal without requiring an additional teacher/model. In the first phase, KFPT augments CE with a selective regularization term (RU) and, at fixed intervals, performs a second forward pass on the same text by masking small blocks in the attention mask, averaging the CE losses to make updates more stable. In the second phase, KFPT adds a one-way KL-consistency term by taking the distribution from a span-drop-induced second view as the target; this term is selectively weighted and strengthened only at useful positions based on the reference view's reliability (gold-margin and correctness) and the student's uncertainty (entropy). We also analyze why the additional terms used in Phase 1 and Phase 2 can be effective through mathematical theorems and proofs. In comprehensive experiments, we compare KFPT across multiple model architectures and training regimes against a strong CE baseline and prior teacher-free objective-improvement methods. The results show that KFPT generally improves accuracy and reduces perplexity, outperforming teacher-free alternatives in the literature.