Knowledge Distillation $\approx$ Label Smoothing: Fact or Fallacy?
- URL: http://arxiv.org/abs/2301.12609v4
- Date: Wed, 25 Oct 2023 03:10:01 GMT
- Title: Knowledge Distillation $\approx$ Label Smoothing: Fact or Fallacy?
- Authors: Md Arafat Sultan
- Abstract summary: We re-examine the equivalence between the methods by comparing the predictive confidences of the models they train.
In most settings, KD and LS drive model confidence in completely opposite directions.
In KD, the student inherits not only its knowledge but also its confidence from the teacher, reinforcing the classical knowledge transfer view.
- Score: 6.323424953013902
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Originally proposed as a method for knowledge transfer from one model to
another, some recent studies have suggested that knowledge distillation (KD) is
in fact a form of regularization. Perhaps the strongest argument of all for
this new perspective comes from its apparent similarities with label smoothing
(LS). Here we re-examine this stated equivalence between the two methods by
comparing the predictive confidences of the models they train. Experiments on
four text classification tasks involving models of different sizes show that:
(a) In most settings, KD and LS drive model confidence in completely opposite
directions, and (b) In KD, the student inherits not only its knowledge but also
its confidence from the teacher, reinforcing the classical knowledge transfer
view.
Related papers
- BicKD: Bilateral Contrastive Knowledge Distillation [7.791534714823052]
Knowledge distillation (KD) is a machine learning framework that transfers knowledge from a teacher model to a student model.<n> vanilla KD has been the dominant approach in logit-based distillation.<n>We propose a simple yet effective methodology, bilateral contrastive knowledge distillation (BicKD)
arXiv Detail & Related papers (2026-02-01T14:54:34Z) - Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods [31.111748100296527]
This study investigates the effect of knowledge distillation on the transferability of debiasing'' capabilities.<n>To the best of our knowledge, this is the first study on the effect of KD on debiasing and its interenal mechanism at scale.
arXiv Detail & Related papers (2025-10-30T00:34:16Z) - Self-Evolution Knowledge Distillation for LLM-based Machine Translation [36.01859033056453]
We propose a distillation strategy called Self-Evolution KD.
The core of this approach involves dynamically integrating teacher distribution and one-hot distribution of ground truth into the student distribution as prior knowledge.
Experimental results show our method brings an average improvement of approximately 1.4 SacreBLEU points across four translation directions in the WMT22 test sets.
arXiv Detail & Related papers (2024-12-19T12:24:15Z) - Gradual Learning: Optimizing Fine-Tuning with Partially Mastered Knowledge in Large Language Models [51.20499954955646]
Large language models (LLMs) acquire vast amounts of knowledge from extensive text corpora during the pretraining phase.
In later stages such as fine-tuning and inference, the model may encounter knowledge not covered in the initial training.
We propose a two-stage fine-tuning strategy to improve the model's overall test accuracy and knowledge retention.
arXiv Detail & Related papers (2024-10-08T08:35:16Z) - Adaptive Explicit Knowledge Transfer for Knowledge Distillation [17.739979156009696]
We show that the performance of logit-based knowledge distillation can be improved by effectively delivering the probability distribution for the non-target classes from the teacher model.
We propose a new loss that enables the student to learn explicit knowledge along with implicit knowledge in an adaptive manner.
Experimental results demonstrate that the proposed method, called adaptive explicit knowledge transfer (AEKT) method, achieves improved performance compared to the state-of-the-art KD methods.
arXiv Detail & Related papers (2024-09-03T07:42:59Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - On the Impact of Knowledge Distillation for Model Interpretability [22.18694053092722]
Knowledge distillation (KD) enhances the interpretability as well as the accuracy of models.
We attribute the improvement in interpretability to the class-similarity information transferred from the teacher to student models.
Our research showed that KD models by large models could be used more reliably in various fields.
arXiv Detail & Related papers (2023-05-25T05:35:11Z) - AD-KD: Attribution-Driven Knowledge Distillation for Language Model
Compression [26.474962405945316]
We present a novel attribution-driven knowledge distillation approach to compress pre-trained language models.
To enhance the knowledge transfer of model reasoning and generalization, we explore multi-view attribution distillation on all potential decisions of the teacher.
arXiv Detail & Related papers (2023-05-17T07:40:12Z) - Adaptively Integrated Knowledge Distillation and Prediction Uncertainty
for Continual Learning [71.43841235954453]
Current deep learning models often suffer from catastrophic forgetting of old knowledge when continually learning new knowledge.
Existing strategies to alleviate this issue often fix the trade-off between keeping old knowledge (stability) and learning new knowledge (plasticity)
arXiv Detail & Related papers (2023-01-18T05:36:06Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - Learning Interpretation with Explainable Knowledge Distillation [28.00216413365036]
Knowledge Distillation (KD) has been considered as a key solution in model compression and acceleration in recent years.
We propose a novel explainable knowledge distillation model, called XDistillation, through which both the performance the explanations' information are transferred from the teacher model to the student model.
Our experiments shows that models trained by XDistillation outperform those trained by conventional KD methods in term of predictive accuracy and also faithfulness to the teacher models.
arXiv Detail & Related papers (2021-11-12T21:18:06Z) - Revisiting Knowledge Distillation: An Inheritance and Exploration
Framework [153.73692961660964]
Knowledge Distillation (KD) is a popular technique to transfer knowledge from a teacher model to a student model.
We propose a novel inheritance and exploration knowledge distillation framework (IE-KD)
Our IE-KD framework is generic and can be easily combined with existing distillation or mutual learning methods for training deep neural networks.
arXiv Detail & Related papers (2021-07-01T02:20:56Z) - Similarity Transfer for Knowledge Distillation [25.042405967561212]
Knowledge distillation is a popular paradigm for learning portable neural networks by transferring the knowledge from a large model into a smaller one.
We propose a novel method called similarity transfer for knowledge distillation (STKD), which aims to fully utilize the similarities between categories of multiple samples.
It shows that STKD substantially has outperformed the vanilla knowledge distillation and has achieved superior accuracy over the state-of-the-art knowledge distillation methods.
arXiv Detail & Related papers (2021-03-18T06:54:59Z) - Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A)
In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them.
Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.