Related papers: ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via $α$-$β$-Divergence

ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via $α$-$β$-Divergence

URL: http://arxiv.org/abs/2505.04560v3
Date: Tue, 03 Jun 2025 12:33:27 GMT
Title: ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via $α$-$β$-Divergence
Authors: Guanghui Wang, Zhiyong Yang, Zitai Wang, Shi Wang, Qianqian Xu, Qingming Huang,
Abstract summary: Knowledge Distillation (KD) transfers knowledge from a large teacher model to a smaller student model.<n>The core challenge in KD lies in balancing two mode-concentration effects.<n>We propose ABKD, a generic framework with $alpha$$beta$-divergence.
Score: 89.630486749083
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Knowledge Distillation (KD) transfers knowledge from a large teacher model to a smaller student model by minimizing the divergence between their output distributions, typically using forward Kullback-Leibler divergence (FKLD) or reverse KLD (RKLD). It has become an effective training paradigm due to the broader supervision information provided by the teacher distribution compared to one-hot labels. We identify that the core challenge in KD lies in balancing two mode-concentration effects: the \textbf{\textit{Hardness-Concentration}} effect, which refers to focusing on modes with large errors, and the \textbf{\textit{Confidence-Concentration}} effect, which refers to focusing on modes with high student confidence. Through an analysis of how probabilities are reassigned during gradient updates, we observe that these two effects are entangled in FKLD and RKLD, but in extreme forms. Specifically, both are too weak in FKLD, causing the student to fail to concentrate on the target class. In contrast, both are too strong in RKLD, causing the student to overly emphasize the target class while ignoring the broader distributional information from the teacher. To address this imbalance, we propose ABKD, a generic framework with $\alpha$-$\beta$-divergence. Our theoretical results show that ABKD offers a smooth interpolation between FKLD and RKLD, achieving an effective trade-off between these effects. Extensive experiments on 17 language/vision datasets with 12 teacher-student settings confirm its efficacy. The code is available at https://github.com/ghwang-s/abkd.

Related papers

Biased Teacher, Balanced Student [0.0]
Long-Tailed Knowledge Distillation (LTKD) is a novel framework tailored for class-imbalanced scenarios.<n>Experiments on CIFAR-100-LT, TinyImageNet-LT, and ImageNet-LT show that LTKD consistently outperforms existing KD methods.
arXiv Detail & Related papers (2025-06-23T10:46:44Z)
Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.<n>In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.<n>We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z)
Efficient and Robust Knowledge Distillation from A Stronger Teacher Based on Correlation Matching [0.09999629695552192]
Correlation Matching Knowledge Distillation (CMKD) method combines the Pearson and Spearman correlation coefficients-based KD loss to achieve more efficient and robust distillation from a stronger teacher model. CMKD is simple yet practical, and extensive experiments demonstrate that it can consistently achieve state-of-the-art performance on CIRAR-100 and ImageNet.
arXiv Detail & Related papers (2024-10-09T05:42:47Z)
Discriminative and Consistent Representation Distillation [6.24302896438145]
Discriminative and Consistent Distillation (DCD)<n>DCD employs a contrastive loss along with a consistency regularization to minimize the discrepancy between the distributions of teacher and student representations.<n>Our method introduces learnable temperature and bias parameters that adapt during training to balance these complementary objectives.
arXiv Detail & Related papers (2024-07-16T14:53:35Z)
Relative Difficulty Distillation for Semantic Segmentation [54.76143187709987]
We propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD) RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals. Our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.
arXiv Detail & Related papers (2024-07-04T08:08:25Z)
Direct Preference Knowledge Distillation for Large Language Models [73.50849692633953]
We propose Direct Preference Knowledge Distillation (DPKD) for large language models (LLMs)<n>We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence.<n>We prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis.
arXiv Detail & Related papers (2024-06-28T09:23:40Z)
De-confounded Data-free Knowledge Distillation for Handling Distribution Shifts [32.1016787150064]
Data-Free Knowledge Distillation (DFKD) is a promising task to train high-performance small models to enhance actual deployment without relying on the original training data. Existing methods commonly avoid relying on private data by utilizing synthetic or sampled data. This paper proposes a novel perspective with causal inference to disentangle the student models from the impact of such shifts.
arXiv Detail & Related papers (2024-03-28T16:13:22Z)
Sinkhorn Distance Minimization for Knowledge Distillation [97.64216712016571]
Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs) In this paper, we show that the aforementioned KL, RKL, and JS divergences respectively suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation. We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between teacher and student distributions.
arXiv Detail & Related papers (2024-02-27T01:13:58Z)
Revisiting Knowledge Distillation for Autoregressive Language Models [88.80146574509195]
We propose a simple yet effective adaptive teaching approach (ATKD) to improve the knowledge distillation (KD) The core of ATKD is to reduce rote learning and make teaching more diverse and flexible. Experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains.
arXiv Detail & Related papers (2024-02-19T07:01:10Z)
Robustness-Reinforced Knowledge Distillation with Correlation Distance and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models. Most existing KD techniques rely on Kullback-Leibler (KL) divergence. We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z)
Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in Knowledge Distillation [9.157410884444312]
Knowledge distillation (KD) has been investigated to design efficient neural architectures. We show that the KL divergence loss focuses on the logit matching when tau increases and the label matching when tau goes to 0. We show that sequential distillation can improve performance and that KD, particularly when using the KL divergence loss with small tau, mitigates the label noise.
arXiv Detail & Related papers (2021-05-19T04:40:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.