Efficient and Robust Knowledge Distillation from A Stronger Teacher Based on Correlation Matching
- URL: http://arxiv.org/abs/2410.06561v1
- Date: Wed, 9 Oct 2024 05:42:47 GMT
- Title: Efficient and Robust Knowledge Distillation from A Stronger Teacher Based on Correlation Matching
- Authors: Wenqi Niu, Yingchao Wang, Guohui Cai, Hanpo Hou,
- Abstract summary: Correlation Matching Knowledge Distillation (CMKD) method combines the Pearson and Spearman correlation coefficients-based KD loss to achieve more efficient and robust distillation from a stronger teacher model.
CMKD is simple yet practical, and extensive experiments demonstrate that it can consistently achieve state-of-the-art performance on CIRAR-100 and ImageNet.
- Score: 0.09999629695552192
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge Distillation (KD) has emerged as a pivotal technique for neural network compression and performance enhancement. Most KD methods aim to transfer dark knowledge from a cumbersome teacher model to a lightweight student model based on Kullback-Leibler (KL) divergence loss. However, the student performance improvements achieved through KD exhibit diminishing marginal returns, where a stronger teacher model does not necessarily lead to a proportionally stronger student model. To address this issue, we empirically find that the KL-based KD method may implicitly change the inter-class relationships learned by the student model, resulting in a more complex and ambiguous decision boundary, which in turn reduces the model's accuracy and generalization ability. Therefore, this study argues that the student model should learn not only the probability values from the teacher's output but also the relative ranking of classes, and proposes a novel Correlation Matching Knowledge Distillation (CMKD) method that combines the Pearson and Spearman correlation coefficients-based KD loss to achieve more efficient and robust distillation from a stronger teacher model. Moreover, considering that samples vary in difficulty, CMKD dynamically adjusts the weights of the Pearson-based loss and Spearman-based loss. CMKD is simple yet practical, and extensive experiments demonstrate that it can consistently achieve state-of-the-art performance on CIRAR-100 and ImageNet, and adapts well to various teacher architectures, sizes, and other KD methods.
Related papers
- Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Revisiting Knowledge Distillation for Autoregressive Language Models [88.80146574509195]
We propose a simple yet effective adaptive teaching approach (ATKD) to improve the knowledge distillation (KD)
The core of ATKD is to reduce rote learning and make teaching more diverse and flexible.
Experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains.
arXiv Detail & Related papers (2024-02-19T07:01:10Z) - Robustness-Reinforced Knowledge Distillation with Correlation Distance
and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models.
Most existing KD techniques rely on Kullback-Leibler (KL) divergence.
We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z) - Knowledge Distillation Beyond Model Compression [13.041607703862724]
Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or ensemble of models (teacher)
In this study, we provide an extensive study on nine different KD methods which covers a broad spectrum of approaches to capture and transfer knowledge.
arXiv Detail & Related papers (2020-07-03T19:54:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.