ALM-KD: Knowledge Distillation with noisy labels via adaptive loss
mixing
- URL: http://arxiv.org/abs/2202.03250v1
- Date: Mon, 7 Feb 2022 14:53:22 GMT
- Title: ALM-KD: Knowledge Distillation with noisy labels via adaptive loss
mixing
- Authors: Durga Sivasubramanian, Pradeep Shenoy, Prathosh AP and Ganesh
Ramakrishnan
- Abstract summary: Knowledge distillation is a technique where the outputs of a pretrained model are used for training a student model in a supervised setting.
We tackle this problem via the use of an adaptive loss mixing scheme during KD.
We demonstrate performance gains obtained using our approach in the standard KD setting as well as in multi-teacher and self-distillation settings.
- Score: 25.49637460661711
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Knowledge distillation is a technique where the outputs of a pretrained
model, often known as the teacher model is used for training a student model in
a supervised setting. The teacher model outputs being a richer distribution
over labels should improve the student model's performance as opposed to
training with the usual hard labels. However, the label distribution imposed by
the logits of the teacher network may not be always informative and may lead to
poor student performance. We tackle this problem via the use of an adaptive
loss mixing scheme during KD. Specifically, our method learns an
instance-specific convex combination of the teacher-matching and label
supervision objectives, using meta learning on a validation metric signalling
to the student `how much' of KD is to be used. Through a range of experiments
on controlled synthetic data and real-world datasets, we demonstrate
performance gains obtained using our approach in the standard KD setting as
well as in multi-teacher and self-distillation settings.
Related papers
- Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - Adapt Your Teacher: Improving Knowledge Distillation for Exemplar-free
Continual Learning [14.379472108242235]
We investigate exemplar-free class incremental learning (CIL) with knowledge distillation (KD) as a regularization strategy.
KD-based methods are successfully used in CIL, but they often struggle to regularize the model without access to exemplars of the training data from previous tasks.
Inspired by recent test-time adaptation methods, we introduce Teacher Adaptation (TA), a method that concurrently updates the teacher and the main models during incremental training.
arXiv Detail & Related papers (2023-08-18T13:22:59Z) - On-Policy Distillation of Language Models: Learning from Self-Generated
Mistakes [44.97759066341107]
Generalized Knowledge Distillation (GKD) trains the student on its self-generated output sequences by leveraging feedback from the teacher.
We demonstrate the efficacy of GKD for distilling auto-regressive language models on summarization, translation, and arithmetic reasoning tasks.
arXiv Detail & Related papers (2023-06-23T17:56:26Z) - Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency.
Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model.
We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - Oracle Teacher: Leveraging Target Information for Better Knowledge
Distillation of CTC Models [10.941519846908697]
We introduce a new type of teacher model for connectionist temporal classification ( CTC)-based sequence models, namely Oracle Teacher.
Since the Oracle Teacher learns a more accurate CTC alignment by referring to the target information, it can provide the student with more optimal guidance.
Based on a many-to-one mapping property of the CTC algorithm, we present a training strategy that can effectively prevent the trivial solution.
arXiv Detail & Related papers (2021-11-05T14:14:05Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z) - Learning from a Lightweight Teacher for Efficient Knowledge Distillation [14.865673786025525]
This paper proposes LW-KD, short for lightweight knowledge distillation.
It firstly trains a lightweight teacher network on a synthesized simple dataset, with an adjustable class number equal to that of a target dataset.
The teacher then generates soft target whereby an enhanced KD loss could guide student learning, which is a combination of KD loss and adversarial loss for making student output indistinguishable from the output of the teacher.
arXiv Detail & Related papers (2020-05-19T01:54:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.