Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in
Knowledge Distillation
- URL: http://arxiv.org/abs/2105.08919v1
- Date: Wed, 19 May 2021 04:40:53 GMT
- Title: Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in
Knowledge Distillation
- Authors: Taehyeon Kim, Jaehoon Oh, NakYil Kim, Sangwook Cho, Se-Young Yun
- Abstract summary: Knowledge distillation (KD) has been investigated to design efficient neural architectures.
We show that the KL divergence loss focuses on the logit matching when tau increases and the label matching when tau goes to 0.
We show that sequential distillation can improve performance and that KD, particularly when using the KL divergence loss with small tau, mitigates the label noise.
- Score: 9.157410884444312
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation (KD), transferring knowledge from a cumbersome teacher
model to a lightweight student model, has been investigated to design efficient
neural architectures. Generally, the objective function of KD is the
Kullback-Leibler (KL) divergence loss between the softened probability
distributions of the teacher model and the student model with the temperature
scaling hyperparameter tau. Despite its widespread use, few studies have
discussed the influence of such softening on generalization. Here, we
theoretically show that the KL divergence loss focuses on the logit matching
when tau increases and the label matching when tau goes to 0 and empirically
show that the logit matching is positively correlated to performance
improvement in general. From this observation, we consider an intuitive KD loss
function, the mean squared error (MSE) between the logit vectors, so that the
student model can directly learn the logit of the teacher model. The MSE loss
outperforms the KL divergence loss, explained by the difference in the
penultimate layer representations between the two losses. Furthermore, we show
that sequential distillation can improve performance and that KD, particularly
when using the KL divergence loss with small tau, mitigates the label noise.
The code to reproduce the experiments is publicly available online at
https://github.com/jhoon-oh/kd_data/.
Related papers
- Efficient and Robust Knowledge Distillation from A Stronger Teacher Based on Correlation Matching [0.09999629695552192]
Correlation Matching Knowledge Distillation (CMKD) method combines the Pearson and Spearman correlation coefficients-based KD loss to achieve more efficient and robust distillation from a stronger teacher model.
CMKD is simple yet practical, and extensive experiments demonstrate that it can consistently achieve state-of-the-art performance on CIRAR-100 and ImageNet.
arXiv Detail & Related papers (2024-10-09T05:42:47Z) - Kendall's $τ$ Coefficient for Logits Distillation [33.77389987117822]
We propose a ranking loss based on Kendall's $tau$ coefficient, called Rank-Kendall Knowledge Distillation (RKKD)
RKKD balances the attention to smaller-valued channels by constraining the order of channel values in student logits.
Our experiments show that our RKKD can enhance the performance of various knowledge distillation baselines and offer broad improvements across multiple teacher-student architecture combinations.
arXiv Detail & Related papers (2024-09-26T13:21:02Z) - Sinkhorn Distance Minimization for Knowledge Distillation [97.64216712016571]
Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs)
In this paper, we show that the aforementioned KL, RKL, and JS divergences respectively suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation.
We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between teacher and student distributions.
arXiv Detail & Related papers (2024-02-27T01:13:58Z) - Cosine Similarity Knowledge Distillation for Individual Class
Information Transfer [11.544799404018473]
We introduce a novel Knowledge Distillation (KD) method capable of achieving results on par with or superior to the teacher models performance.
We use cosine similarity, a technique in Natural Language Processing (NLP) for measuring the resemblance between text embeddings.
We propose a method called cosine similarity weighted temperature (CSWT) to improve the performance.
arXiv Detail & Related papers (2023-11-24T06:34:47Z) - Knowledge Distillation Performs Partial Variance Reduction [93.6365393721122]
Knowledge distillation is a popular approach for enhancing the performance of ''student'' models.
The underlying mechanics behind knowledge distillation (KD) are still not fully understood.
We show that KD can be interpreted as a novel type of variance reduction mechanism.
arXiv Detail & Related papers (2023-05-27T21:25:55Z) - Decoupled Kullback-Leibler Divergence Loss [90.54331083430597]
We prove that the Kullback-Leibler (KL) Divergence loss is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss.
We introduce class-wise global information into KL/DKL to bias from individual samples.
The proposed approach achieves new state-of-the-art adversarial robustness on the public leaderboard.
arXiv Detail & Related papers (2023-05-23T11:17:45Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - Causal KL: Evaluating Causal Discovery [0.0]
Two most commonly used criteria for assessing causal model discovery with artificial data are edit-distance and Kullback-Leibler divergence.
We argue that they are both insufficiently discriminating in judging the relative merits of false models.
We propose an augmented KL divergence, which takes into account causal relationships which distinguish between observationally equivalent models.
arXiv Detail & Related papers (2021-11-11T02:46:53Z) - KDExplainer: A Task-oriented Attention Model for Explaining Knowledge
Distillation [59.061835562314066]
We introduce a novel task-oriented attention model, termed as KDExplainer, to shed light on the working mechanism underlying the vanilla KD.
We also introduce a portable tool, dubbed as virtual attention module (VAM), that can be seamlessly integrated with various deep neural networks (DNNs) to enhance their performance under KD.
arXiv Detail & Related papers (2021-05-10T08:15:26Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.