Comparative Knowledge Distillation
- URL: http://arxiv.org/abs/2311.02253v1
- Date: Fri, 3 Nov 2023 21:55:33 GMT
- Title: Comparative Knowledge Distillation
- Authors: Alex Wilf, Alex Tianyi Xu, Paul Pu Liang, Alexander Obolenskiy, Daniel
Fried, Louis-Philippe Morency
- Abstract summary: Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
- Score: 102.35425896967791
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the era of large scale pretrained models, Knowledge Distillation (KD)
serves an important role in transferring the wisdom of computationally heavy
teacher models to lightweight, efficient student models while preserving
performance. Traditional KD paradigms, however, assume readily available access
to teacher models for frequent inference -- a notion increasingly at odds with
the realities of costly, often proprietary, large scale models. Addressing this
gap, our paper considers how to minimize the dependency on teacher model
inferences in KD in a setting we term Few Teacher Inference Knowledge
Distillation (FTI KD). We observe that prevalent KD techniques and state of the
art data augmentation strategies fall short in this constrained setting.
Drawing inspiration from educational principles that emphasize learning through
comparison, we propose Comparative Knowledge Distillation (CKD), which
encourages student models to understand the nuanced differences in a teacher
model's interpretations of samples. Critically, CKD provides additional
learning signals to the student without making additional teacher calls. We
also extend the principle of CKD to groups of samples, enabling even more
efficient learning from limited teacher calls. Empirical evaluation across
varied experimental settings indicates that CKD consistently outperforms state
of the art data augmentation and KD techniques.
Related papers
- Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Efficient and Robust Knowledge Distillation from A Stronger Teacher Based on Correlation Matching [0.09999629695552192]
Correlation Matching Knowledge Distillation (CMKD) method combines the Pearson and Spearman correlation coefficients-based KD loss to achieve more efficient and robust distillation from a stronger teacher model.
CMKD is simple yet practical, and extensive experiments demonstrate that it can consistently achieve state-of-the-art performance on CIRAR-100 and ImageNet.
arXiv Detail & Related papers (2024-10-09T05:42:47Z) - Invariant Causal Knowledge Distillation in Neural Networks [6.24302896438145]
In this paper, we introduce Invariant Consistency Distillation (ICD), a novel methodology designed to enhance knowledge distillation.
ICD ensures that the student model's representations are both discriminative and invariant with respect to the teacher's outputs.
Our results on CIFAR-100 and ImageNet ILSVRC-2012 show that ICD outperforms traditional KD techniques and surpasses state-of-the-art methods.
arXiv Detail & Related papers (2024-07-16T14:53:35Z) - Revisiting Knowledge Distillation for Autoregressive Language Models [88.80146574509195]
We propose a simple yet effective adaptive teaching approach (ATKD) to improve the knowledge distillation (KD)
The core of ATKD is to reduce rote learning and make teaching more diverse and flexible.
Experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains.
arXiv Detail & Related papers (2024-02-19T07:01:10Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z) - Knowledge Distillation Beyond Model Compression [13.041607703862724]
Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or ensemble of models (teacher)
In this study, we provide an extensive study on nine different KD methods which covers a broad spectrum of approaches to capture and transfer knowledge.
arXiv Detail & Related papers (2020-07-03T19:54:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.