MixKD: Towards Efficient Distillation of Large-scale Language Models
- URL: http://arxiv.org/abs/2011.00593v2
- Date: Wed, 17 Mar 2021 06:38:05 GMT
- Title: MixKD: Towards Efficient Distillation of Large-scale Language Models
- Authors: Kevin J Liang, Weituo Hao, Dinghan Shen, Yufan Zhou, Weizhu Chen,
Changyou Chen, Lawrence Carin
- Abstract summary: We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
- Score: 129.73786264834894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale language models have recently demonstrated impressive empirical
performance. Nevertheless, the improved results are attained at the price of
bigger models, more power consumption, and slower inference, which hinder their
applicability to low-resource (both memory and computation) platforms.
Knowledge distillation (KD) has been demonstrated as an effective framework for
compressing such big models. However, large-scale neural network systems are
prone to memorize training instances, and thus tend to make inconsistent
predictions when the data distribution is altered slightly. Moreover, the
student model has few opportunities to request useful information from the
teacher model when there is limited task-specific data available. To address
these issues, we propose MixKD, a data-agnostic distillation framework that
leverages mixup, a simple yet efficient data augmentation approach, to endow
the resulting model with stronger generalization ability. Concretely, in
addition to the original training examples, the student model is encouraged to
mimic the teacher's behavior on the linear interpolation of example pairs as
well. We prove from a theoretical perspective that under reasonable conditions
MixKD gives rise to a smaller gap between the generalization error and the
empirical error. To verify its effectiveness, we conduct experiments on the
GLUE benchmark, where MixKD consistently leads to significant gains over the
standard KD training, and outperforms several competitive baselines.
Experiments under a limited-data setting and ablation studies further
demonstrate the advantages of the proposed approach.
Related papers
- Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Enhancing Knowledge Distillation of Large Language Models through Efficient Multi-Modal Distribution Alignment [10.104085497265004]
We propose Ranking Loss based Knowledge Distillation (RLKD), which encourages consistency of peak predictions between the teacher and student models.
Our method enables the student model to better learn the multi-modal distributions of the teacher model, leading to a significant performance improvement in various downstream tasks.
arXiv Detail & Related papers (2024-09-19T08:06:42Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - DistiLLM: Towards Streamlined Distillation for Large Language Models [53.46759297929675]
DistiLLM is a more effective and efficient KD framework for auto-regressive language models.
DisiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs.
arXiv Detail & Related papers (2024-02-06T11:10:35Z) - Robustness-Reinforced Knowledge Distillation with Correlation Distance
and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models.
Most existing KD techniques rely on Kullback-Leibler (KL) divergence.
We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - Reducing Capacity Gap in Knowledge Distillation with Review Mechanism
for Crowd Counting [16.65360204274379]
This paper introduces a novel review mechanism following KD models, motivated by the review mechanism of human-beings during the study.
The effectiveness of ReviewKD is demonstrated by a set of experiments over six benchmark datasets.
We also show that the suggested review mechanism can be used as a plug-and-play module to further boost the performance of a kind of heavy crowd counting models.
arXiv Detail & Related papers (2022-06-11T09:11:42Z) - DisCo: Effective Knowledge Distillation For Contrastive Learning of
Sentence Embeddings [36.37939188680754]
We propose an enhanced knowledge distillation framework termed Distill-Contrast (DisCo)
DisCo transfers the capability of a large sentence embedding model to a small student model on large unlabelled data.
We also propose Contrastive Knowledge Distillation (CKD) to enhance the consistencies among teacher model training, KD, and student model finetuning.
arXiv Detail & Related papers (2021-12-10T16:11:23Z) - Self-Damaging Contrastive Learning [92.34124578823977]
Unlabeled data in reality is commonly imbalanced and shows a long-tail distribution.
This paper proposes a principled framework called Self-Damaging Contrastive Learning to automatically balance the representation learning without knowing the classes.
Our experiments show that SDCLR significantly improves not only overall accuracies but also balancedness.
arXiv Detail & Related papers (2021-06-06T00:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.