Context-Aware Knowledge Distillation with Adaptive Weighting for Image Classification
- URL: http://arxiv.org/abs/2509.05319v1
- Date: Sat, 30 Aug 2025 09:18:46 GMT
- Title: Context-Aware Knowledge Distillation with Adaptive Weighting for Image Classification
- Authors: Zhengda Li,
- Abstract summary: Adaptive Knowledge Distillation (AKD) is a technique to transfer knowledge from a large teacher network to a smaller student model.<n>We make alpha as learnable parameter that can be automatically learned and optimized during training.<n>Our approach achieves superior accuracy compared to fixed-weight KD baselines, and yields more stable convergence.
- Score: 0.33842793760651557
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation (KD) is a widely used technique to transfer knowledge from a large teacher network to a smaller student model. Traditional KD uses a fixed balancing factor alpha as a hyperparameter to combine the hard-label cross-entropy loss with the soft-label distillation loss. However, a static alpha is suboptimal because the optimal trade-off between hard and soft supervision can vary during training. In this work, we propose an Adaptive Knowledge Distillation (AKD) framework. First we try to make alpha as learnable parameter that can be automatically learned and optimized during training. Then we introduce a formula to reflect the gap between the student and the teacher to compute alpha dynamically, guided by student-teacher discrepancies, and further introduce a Context-Aware Module (CAM) using MLP + Attention to adaptively reweight class-wise teacher outputs. Experiments on CIFAR-10 with ResNet-50 as teacher and ResNet-18 as student demonstrate that our approach achieves superior accuracy compared to fixed-weight KD baselines, and yields more stable convergence.
Related papers
- Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation [13.970649308150895]
Knowledge distillation (KD) methods can transfer knowledge of a parameter-heavy teacher model to a light-weight student model.<n>We propose a feature KD framework for training the student's backbone using feature-based losses exclusively.<n>We demonstrate our KD method achieves state-of-the-art performance, delivering top-1 accuracy boosts of up to 15% over standard approaches.
arXiv Detail & Related papers (2025-11-18T23:50:31Z) - PLD: A Choice-Theoretic List-Wise Knowledge Distillation [14.801268227422517]
We recast knowledge distillation under the Plackett-Luce model by interpreting teacher logits as "worth" scores.<n>We introduce Plackett-Luce Distillation (PLD), a weighted list-wise ranking loss in which the teacher model transfers knowledge of its full ranking of classes.
arXiv Detail & Related papers (2025-06-14T15:31:54Z) - ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via $α$-$β$-Divergence [89.630486749083]
Knowledge Distillation (KD) transfers knowledge from a large teacher model to a smaller student model.<n>The core challenge in KD lies in balancing two mode-concentration effects.<n>We propose ABKD, a generic framework with $alpha$$beta$-divergence.
arXiv Detail & Related papers (2025-05-07T16:48:49Z) - Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.<n> Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z) - DisWOT: Student Architecture Search for Distillation WithOut Training [0.0]
We explore a novel training-free framework to search for the best student architectures for a given teacher.
Our work first empirically show that the optimal model under vanilla training cannot be the winner in distillation.
Our experiments on CIFAR, ImageNet and NAS-Bench-201 demonstrate that our technique achieves state-of-the-art results on different search spaces.
arXiv Detail & Related papers (2023-03-28T01:58:45Z) - Online Hyperparameter Optimization for Class-Incremental Learning [99.70569355681174]
Class-incremental learning (CIL) aims to train a classification model while the number of classes increases phase-by-phase.
An inherent challenge of CIL is the stability-plasticity tradeoff, i.e., CIL models should keep stable to retain old knowledge and keep plastic to absorb new knowledge.
We propose an online learning method that can adaptively optimize the tradeoff without knowing the setting as a priori.
arXiv Detail & Related papers (2023-01-11T17:58:51Z) - Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency.
Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model.
We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z) - Switchable Online Knowledge Distillation [68.2673580932132]
Online Knowledge Distillation (OKD) improves involved models by reciprocally exploiting the difference between teacher and student.
We propose Switchable Online Knowledge Distillation (SwitOKD) to answer these questions.
arXiv Detail & Related papers (2022-09-12T03:03:40Z) - Parameter-Efficient and Student-Friendly Knowledge Distillation [83.56365548607863]
We present a parameter-efficient and student-friendly knowledge distillation method, namely PESF-KD, to achieve efficient and sufficient knowledge transfer.
Experiments on a variety of benchmarks show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.
arXiv Detail & Related papers (2022-05-28T16:11:49Z) - ALM-KD: Knowledge Distillation with noisy labels via adaptive loss
mixing [25.49637460661711]
Knowledge distillation is a technique where the outputs of a pretrained model are used for training a student model in a supervised setting.
We tackle this problem via the use of an adaptive loss mixing scheme during KD.
We demonstrate performance gains obtained using our approach in the standard KD setting as well as in multi-teacher and self-distillation settings.
arXiv Detail & Related papers (2022-02-07T14:53:22Z) - Dynamic Rectification Knowledge Distillation [0.0]
Dynamic Rectification Knowledge Distillation (DR-KD) is a knowledge distillation framework.
DR-KD transforms the student into its own teacher, and if the self-teacher makes wrong predictions while distilling information, the error is rectified prior to the knowledge being distilled.
Our proposed DR-KD performs remarkably well in the absence of a sophisticated cumbersome teacher model.
arXiv Detail & Related papers (2022-01-27T04:38:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.