Extending Label Smoothing Regularization with Self-Knowledge
Distillation
- URL: http://arxiv.org/abs/2009.05226v1
- Date: Fri, 11 Sep 2020 04:23:34 GMT
- Title: Extending Label Smoothing Regularization with Self-Knowledge
Distillation
- Authors: Ji-Yue Wang, Pei Zhang, Wen-feng Pang, Jie Li
- Abstract summary: We propose an algorithm LsrKD for training boost by extending the LSR method to the KD regime and applying a softer temperature.
To further improve the performance of LsrKD, we develop a self-distillation method named Memory-replay Knowledge Distillation (MrKD)
Our experiments show that LsrKD can improve LSR performance consistently at no cost, especially on several deep neural networks where LSR is ineffectual.
- Score: 11.009345791558601
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inspired by the strong correlation between the Label Smoothing
Regularization(LSR) and Knowledge distillation(KD), we propose an algorithm
LsrKD for training boost by extending the LSR method to the KD regime and
applying a softer temperature. Then we improve the LsrKD by a Teacher
Correction(TC) method, which manually sets a constant larger proportion for the
right class in the uniform distribution teacher. To further improve the
performance of LsrKD, we develop a self-distillation method named Memory-replay
Knowledge Distillation (MrKD) that provides a knowledgeable teacher to replace
the uniform distribution one in LsrKD. The MrKD method penalizes the KD loss
between the current model's output distributions and its copies' on the
training trajectory. By preventing the model learning so far from its
historical output distribution space, MrKD can stabilize the learning and find
a more robust minimum. Our experiments show that LsrKD can improve LSR
performance consistently at no cost, especially on several deep neural networks
where LSR is ineffectual. Also, MrKD can significantly improve single model
training. The experiment results confirm that the TC can help LsrKD and MrKD to
boost training, especially on the networks they are failed. Overall, LsrKD,
MrKD, and their TC variants are comparable to or outperform the LSR method,
suggesting the broad applicability of these KD methods.
Related papers
- Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Efficient and Robust Knowledge Distillation from A Stronger Teacher Based on Correlation Matching [0.09999629695552192]
Correlation Matching Knowledge Distillation (CMKD) method combines the Pearson and Spearman correlation coefficients-based KD loss to achieve more efficient and robust distillation from a stronger teacher model.
CMKD is simple yet practical, and extensive experiments demonstrate that it can consistently achieve state-of-the-art performance on CIRAR-100 and ImageNet.
arXiv Detail & Related papers (2024-10-09T05:42:47Z) - Relative Difficulty Distillation for Semantic Segmentation [54.76143187709987]
We propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD)
RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals.
Our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.
arXiv Detail & Related papers (2024-07-04T08:08:25Z) - Direct Preference Knowledge Distillation for Large Language Models [73.50849692633953]
We propose Direct Preference Knowledge Distillation (DPKD) for large language models (LLMs)
We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence.
We prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis.
arXiv Detail & Related papers (2024-06-28T09:23:40Z) - Revisiting Knowledge Distillation for Autoregressive Language Models [88.80146574509195]
We propose a simple yet effective adaptive teaching approach (ATKD) to improve the knowledge distillation (KD)
The core of ATKD is to reduce rote learning and make teaching more diverse and flexible.
Experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains.
arXiv Detail & Related papers (2024-02-19T07:01:10Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - Adapt Your Teacher: Improving Knowledge Distillation for Exemplar-free
Continual Learning [14.379472108242235]
We investigate exemplar-free class incremental learning (CIL) with knowledge distillation (KD) as a regularization strategy.
KD-based methods are successfully used in CIL, but they often struggle to regularize the model without access to exemplars of the training data from previous tasks.
Inspired by recent test-time adaptation methods, we introduce Teacher Adaptation (TA), a method that concurrently updates the teacher and the main models during incremental training.
arXiv Detail & Related papers (2023-08-18T13:22:59Z) - On-Policy Distillation of Language Models: Learning from Self-Generated
Mistakes [44.97759066341107]
Generalized Knowledge Distillation (GKD) trains the student on its self-generated output sequences by leveraging feedback from the teacher.
We demonstrate the efficacy of GKD for distilling auto-regressive language models on summarization, translation, and arithmetic reasoning tasks.
arXiv Detail & Related papers (2023-06-23T17:56:26Z) - Knowledge Distillation Thrives on Data Augmentation [65.58705111863814]
Knowledge distillation (KD) is a general deep neural network training framework that uses a teacher model to guide a student model.
Many works have explored the rationale for its success, however, its interplay with data augmentation (DA) has not been well recognized so far.
In this paper, we are motivated by an interesting observation in classification: KD loss can benefit from extended training iterations while the cross-entropy loss does not.
We show this disparity arises because of data augmentation: KD loss can tap into the extra information from different input views brought by DA.
arXiv Detail & Related papers (2020-12-05T00:32:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.