Biased Teacher, Balanced Student
- URL: http://arxiv.org/abs/2506.18496v1
- Date: Mon, 23 Jun 2025 10:46:44 GMT
- Title: Biased Teacher, Balanced Student
- Authors: Seonghak Kim,
- Abstract summary: Long-Tailed Knowledge Distillation (LTKD) is a novel framework tailored for class-imbalanced scenarios.<n>Experiments on CIFAR-100-LT, TinyImageNet-LT, and ImageNet-LT show that LTKD consistently outperforms existing KD methods.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge Distillation (KD) is a widely adopted model compression technique where a compact student model learns from the output of a larger, pre-trained teacher. While effective in balanced settings, conventional KD suffers significantly when applied to long-tailed data distributions, as the teacher model tends to be biased toward head classes and provides limited supervision for tail classes. In this paper, we propose Long-Tailed Knowledge Distillation (LTKD), a novel framework tailored for class-imbalanced scenarios. We begin by reformulating the standard KD objective into two components: inter-group and intra-group Kullback-Leibler (KL) divergence, corresponding to the prediction distributions across and within class groups (head, medium, tail), respectively. This decomposition allows us to identify and quantify the sources of teacher bias. To address them, we introduce (1) a rebalanced inter-group loss that calibrates the teacher's group-level predictions and (2) a uniform intra-group loss that ensures equal contribution from all groups during distillation. Extensive experiments on CIFAR-100-LT, TinyImageNet-LT, and ImageNet-LT show that LTKD consistently outperforms existing KD methods, achieving significant gains in both overall accuracy and tail-class performance. Our results demonstrate that LTKD enables effective knowledge transfer even from biased teachers, making it a strong candidate for real-world deployment in resource-constrained and imbalanced settings.
Related papers
- Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation [57.524909883706556]
On-policy distillation (OPD) has demonstrated strong empirical gains in improving student performance.<n>This work introduces a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization.<n>In particular, in the setting where we merge the knowledge from different domain experts, obtained by applying domain-specific RL to the same student model, ExOPD enables the student to even surpass the teacher's performance boundary.
arXiv Detail & Related papers (2026-02-12T16:14:29Z) - REDistill: Robust Estimator Distillation for Balancing Robustness and Efficiency [0.0]
We introduce REDistill, a principled framework grounded in robust statistics.<n>Redistill replaces the standard KD objective with a power divergence loss, a generalization of KL divergence.<n>Experiments on CIFAR-100 and ImageNet-1k demonstrate that REDistill consistently improves student accuracy in diverse teacher-student architectures.
arXiv Detail & Related papers (2026-02-04T15:50:53Z) - SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines [82.00660447875266]
Knowledge Distillation (KD) is a central paradigm for transferring knowledge from a large teacher network to a typically smaller student model, often by leveraging soft probabilistic outputs.<n>We rigorously analyze the convergence behavior of students trained with Gradient Descent (SGD)<n>Our analysis shows that learning from BCPs yields variance reduction and removes neighborhood terms in the convergence bounds compared to one-hot supervision.<n>Motivated by these insights, we advocate the use of Bayesian deep learning models, which typically provide improved estimates of the BCPs, as teachers in KD.
arXiv Detail & Related papers (2026-01-04T11:09:49Z) - ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via $α$-$β$-Divergence [89.630486749083]
Knowledge Distillation (KD) transfers knowledge from a large teacher model to a smaller student model.<n>The core challenge in KD lies in balancing two mode-concentration effects.<n>We propose ABKD, a generic framework with $alpha$$beta$-divergence.
arXiv Detail & Related papers (2025-05-07T16:48:49Z) - Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.<n>In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.<n>We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Efficient and Robust Knowledge Distillation from A Stronger Teacher Based on Correlation Matching [0.09999629695552192]
Correlation Matching Knowledge Distillation (CMKD) method combines the Pearson and Spearman correlation coefficients-based KD loss to achieve more efficient and robust distillation from a stronger teacher model.
CMKD is simple yet practical, and extensive experiments demonstrate that it can consistently achieve state-of-the-art performance on CIRAR-100 and ImageNet.
arXiv Detail & Related papers (2024-10-09T05:42:47Z) - Enhancing Knowledge Distillation of Large Language Models through Efficient Multi-Modal Distribution Alignment [10.104085497265004]
We propose Ranking Loss based Knowledge Distillation (RLKD), which encourages consistency of peak predictions between the teacher and student models.<n>Our method enables the student model to better learn the multi-modal distributions of the teacher model, leading to a significant performance improvement in various downstream tasks.
arXiv Detail & Related papers (2024-09-19T08:06:42Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.