Improve Knowledge Distillation via Label Revision and Data Selection
- URL: http://arxiv.org/abs/2404.03693v1
- Date: Wed, 3 Apr 2024 02:41:16 GMT
- Title: Improve Knowledge Distillation via Label Revision and Data Selection
- Authors: Weichao Lan, Yiu-ming Cheung, Qing Xu, Buhua Liu, Zhikai Hu, Mengke Li, Zhenghua Chen,
- Abstract summary: This paper proposes to rectify the teacher's inaccurate predictions using the ground truth.
In the latter, we introduce a data selection technique to choose suitable training samples to be supervised by the teacher.
Experiment results demonstrate the effectiveness of our proposed method, and show that our method can be combined with other distillation approaches.
- Score: 37.74822443555646
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation (KD) has become a widely used technique in the field of model compression, which aims to transfer knowledge from a large teacher model to a lightweight student model for efficient network development. In addition to the supervision of ground truth, the vanilla KD method regards the predictions of the teacher as soft labels to supervise the training of the student model. Based on vanilla KD, various approaches have been developed to further improve the performance of the student model. However, few of these previous methods have considered the reliability of the supervision from teacher models. Supervision from erroneous predictions may mislead the training of the student model. This paper therefore proposes to tackle this problem from two aspects: Label Revision to rectify the incorrect supervision and Data Selection to select appropriate samples for distillation to reduce the impact of erroneous supervision. In the former, we propose to rectify the teacher's inaccurate predictions using the ground truth. In the latter, we introduce a data selection technique to choose suitable training samples to be supervised by the teacher, thereby reducing the impact of incorrect predictions to some extent. Experiment results demonstrate the effectiveness of our proposed method, and show that our method can be combined with other distillation approaches, improving their performance.
Related papers
- Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Dynamic Guidance Adversarial Distillation with Enhanced Teacher Knowledge [17.382306203152943]
Dynamic Guidance Adversarial Distillation (DGAD) framework tackles the challenge of differential sample importance.
DGAD employs Misclassification-Aware Partitioning (MAP) to dynamically tailor the distillation focus.
Error-corrective Label Swapping (ELS) corrects misclassifications of the teacher on both clean and adversarially perturbed inputs.
arXiv Detail & Related papers (2024-09-03T05:52:37Z) - A Study on Knowledge Distillation from Weak Teacher for Scaling Up
Pre-trained Language Models [104.64899255277443]
Distillation from Weak Teacher (DWT) is a method of transferring knowledge from a smaller, weaker teacher model to a larger student model to improve its performance.
This study examines three key factors to optimize DWT, distinct from those used in the vision domain or traditional knowledge distillation.
arXiv Detail & Related papers (2023-05-26T13:24:49Z) - Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency.
Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model.
We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z) - Dual Correction Strategy for Ranking Distillation in Top-N Recommender System [22.37864671297929]
This paper presents Dual Correction strategy for Knowledge Distillation (DCD)
DCD transfers the ranking information from the teacher model to the student model in a more efficient manner.
Our experiments show that the proposed method outperforms the state-of-the-art baselines.
arXiv Detail & Related papers (2021-09-08T07:00:45Z) - Knowledge Distillation as Semiparametric Inference [44.572422527672416]
A popular approach to model compression is to train an inexpensive student model to mimic the class probabilities of a highly accurate but cumbersome teacher model.
This two-step knowledge distillation process often leads to higher accuracy than training the student directly on labeled data.
We cast knowledge distillation as a semiparametric inference problem with the optimal student model as the target, the unknown Bayes class probabilities as nuisance, and the teacher probabilities as a plug-in nuisance estimate.
arXiv Detail & Related papers (2021-04-20T03:00:45Z) - DE-RRD: A Knowledge Distillation Framework for Recommender System [16.62204445256007]
We propose a knowledge distillation framework for recommender system, called DE-RRD.
It enables the student model to learn from the latent knowledge encoded in the teacher model as well as from the teacher's predictions.
Our experiments show that DE-RRD outperforms the state-of-the-art competitors and achieves comparable or even better performance to that of the teacher model with faster inference time.
arXiv Detail & Related papers (2020-12-08T11:09:22Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.