On the Generalization vs Fidelity Paradox in Knowledge Distillation
- URL: http://arxiv.org/abs/2505.15442v2
- Date: Mon, 04 Aug 2025 12:03:51 GMT
- Title: On the Generalization vs Fidelity Paradox in Knowledge Distillation
- Authors: Suhas Kamasetty Ramesh, Ayan Sengupta, Tanmoy Chakraborty,
- Abstract summary: Knowledge distillation (KD) is a technique for compressing large language models into smaller ones while preserving performance.<n>We present the first large-scale empirical and statistical analysis of KD across models ranging from 0.5B to 7B parameters.<n>Our findings reveal that KD can improve the average performance of smaller models by up to $10%$, with a peak task specific gain of $22%$.
- Score: 20.62274005080048
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation (KD) is a key technique for compressing large language models into smaller ones while preserving performance. Despite the recent traction of KD research, its effectiveness for smaller language models (LMs) and the mechanisms driving knowledge transfer remain underexplored. In this work, we present the first large-scale empirical and statistical analysis of KD across models ranging from 0.5B to 7B parameters on 14 complex reasoning tasks in a zero-shot setting. Our findings reveal that KD can improve the average performance of smaller models by up to $10\%$, with a peak task specific gain of $22\%$, while providing only marginal benefits ($\sim 1.3\%$) for larger models. Surprisingly, teacher performance has a minimal impact on student outcomes, while teacher task expertise impacts KD effectiveness. A correlation study indicates that smaller LMs benefit more from KD, whereas larger LMs show diminished gains. Additionally, we uncover a misalignment between improvements in student performance and reasoning fidelity, suggesting that while KD enhances accuracy, it does not always maintain the structured decision-making processes of the teacher. Our ablation study further highlights the importance of teacher signals and logit smoothing in influencing students' performance after distillation. Overall, our study offers a comprehensive empirical and statistical assessment of KD, highlighting both its benefits and trade-offs when distilling knowledge from larger to smaller LMs.
Related papers
- Efficient and Robust Knowledge Distillation from A Stronger Teacher Based on Correlation Matching [0.09999629695552192]
Correlation Matching Knowledge Distillation (CMKD) method combines the Pearson and Spearman correlation coefficients-based KD loss to achieve more efficient and robust distillation from a stronger teacher model.
CMKD is simple yet practical, and extensive experiments demonstrate that it can consistently achieve state-of-the-art performance on CIRAR-100 and ImageNet.
arXiv Detail & Related papers (2024-10-09T05:42:47Z) - Revisiting Knowledge Distillation for Autoregressive Language Models [88.80146574509195]
We propose a simple yet effective adaptive teaching approach (ATKD) to improve the knowledge distillation (KD)
The core of ATKD is to reduce rote learning and make teaching more diverse and flexible.
Experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains.
arXiv Detail & Related papers (2024-02-19T07:01:10Z) - DistiLLM: Towards Streamlined Distillation for Large Language Models [53.46759297929675]
DistiLLM is a more effective and efficient KD framework for auto-regressive language models.
DisiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs.
arXiv Detail & Related papers (2024-02-06T11:10:35Z) - Robustness-Reinforced Knowledge Distillation with Correlation Distance and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models.<n>Most existing KD techniques rely on Kullback-Leibler (KL) divergence.<n>We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Knowledge Distillation Thrives on Data Augmentation [65.58705111863814]
Knowledge distillation (KD) is a general deep neural network training framework that uses a teacher model to guide a student model.
Many works have explored the rationale for its success, however, its interplay with data augmentation (DA) has not been well recognized so far.
In this paper, we are motivated by an interesting observation in classification: KD loss can benefit from extended training iterations while the cross-entropy loss does not.
We show this disparity arises because of data augmentation: KD loss can tap into the extra information from different input views brought by DA.
arXiv Detail & Related papers (2020-12-05T00:32:04Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.