KFD: Selective Token Filtering and Adaptive Weighting for Efficient Knowledge Distillation

YÜCE, Muzaffer; AMASYALI, Mehmet

doi:10.3390/sym18040667

KFD: Selective Token Filtering and Adaptive Weighting for Efficient Knowledge Distillation

YÜCE M. K., AMASYALI M. F.

SYMMETRY-BASEL, cilt.18, sa.4, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 18 Sayı: 4
Basım Tarihi: 2026
Doi Numarası: 10.3390/sym18040667
Dergi Adı: SYMMETRY-BASEL
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, INSPEC, zbMATH
Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
Yıldız Teknik Üniversitesi Adresli: Evet

Özet

Knowledge distillation (KD) transfers knowledge from large language models (LLMs) to smaller or similarly sized models in order to obtain efficient yet capable systems. However, performing distillation over all tokens is computationally expensive and may weaken the transfer signal. To address this limitation, Knowledge-Filtered Distillation (KFD) is introduced as a selective distillation approach in which tokens are filtered according to the divergence KL(M2 divided by M0) between a teacher model (M2) and a base model (M0), while the student model (M1) is also derived from the same base model. Only tokens whose divergence exceeds a predefined threshold are distilled. For the selected tokens, the teacher distribution is normalized over the Top-5 predictions, whereas tokens outside this case receive a label-ranking bonus. The proposed conditional Top-5/bonus target design is shown theoretically to yield a lower label-focused target error than using only Top-5 normalization or only the bonus across all tokens. In addition, the KL and cross-entropy (CE) losses are balanced through a dynamically computed batch-level coefficient alpha. Experiments on multiple Turkish text datasets show that KFD consistently outperforms CE-only training, achieving higher accuracy with less data and shorter training time. KFD also outperforms entropy-based token selection methods and highlights the role of student initialization in effective knowledge transfer, thereby providing an efficient and scalable distillation framework for teacher-student models of equal size.