Knowledge From the Dark Side: Entropy-Reweighted Knowledge Distillation
for Balanced Knowledge Transfer
- URL: http://arxiv.org/abs/2311.13621v1
- Date: Wed, 22 Nov 2023 08:34:33 GMT
- Title: Knowledge From the Dark Side: Entropy-Reweighted Knowledge Distillation
for Balanced Knowledge Transfer
- Authors: Chi-Ping Su, Ching-Hsun Tseng, Shin-Jye Lee
- Abstract summary: Distillation (KD) transfers knowledge from a larger "teacher" model to a student.
ERKD uses entropy in the teacher's predictions to reweight the KD loss on a sample-wise basis.
Our code is available at https://github.com/cpsu00/ER-KD.
- Score: 1.2606200500489302
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge Distillation (KD) transfers knowledge from a larger "teacher" model
to a compact "student" model, guiding the student with the "dark knowledge"
$\unicode{x2014}$ the implicit insights present in the teacher's soft
predictions. Although existing KDs have shown the potential of transferring
knowledge, the gap between the two parties still exists. With a series of
investigations, we argue the gap is the result of the student's overconfidence
in prediction, signaling an imbalanced focus on pronounced features while
overlooking the subtle yet crucial dark knowledge. To overcome this, we
introduce the Entropy-Reweighted Knowledge Distillation (ER-KD), a novel
approach that leverages the entropy in the teacher's predictions to reweight
the KD loss on a sample-wise basis. ER-KD precisely refocuses the student on
challenging instances rich in the teacher's nuanced insights while reducing the
emphasis on simpler cases, enabling a more balanced knowledge transfer.
Consequently, ER-KD not only demonstrates compatibility with various
state-of-the-art KD methods but also further enhances their performance at
negligible cost. This approach offers a streamlined and effective strategy to
refine the knowledge transfer process in KD, setting a new paradigm in the
meticulous handling of dark knowledge. Our code is available at
https://github.com/cpsu00/ER-KD.
Related papers
- Adaptive Explicit Knowledge Transfer for Knowledge Distillation [17.739979156009696]
We show that the performance of logit-based knowledge distillation can be improved by effectively delivering the probability distribution for the non-target classes from the teacher model.
We propose a new loss that enables the student to learn explicit knowledge along with implicit knowledge in an adaptive manner.
Experimental results demonstrate that the proposed method, called adaptive explicit knowledge transfer (AEKT) method, achieves improved performance compared to the state-of-the-art KD methods.
arXiv Detail & Related papers (2024-09-03T07:42:59Z) - Multi Teacher Privileged Knowledge Distillation for Multimodal Expression Recognition [58.41784639847413]
Human emotion is a complex phenomenon conveyed and perceived through facial expressions, vocal tones, body language, and physiological signals.
In this paper, a multi-teacher PKD (MT-PKDOT) method with self-distillation is introduced to align diverse teacher representations before distilling them to the student.
Results indicate that our proposed method can outperform SOTA PKD methods.
arXiv Detail & Related papers (2024-08-16T22:11:01Z) - Dynamic Temperature Knowledge Distillation [9.6046915661065]
Temperature plays a pivotal role in moderating label softness in the realm of knowledge distillation (KD)
Traditional approaches often employ a static temperature throughout the KD process.
We propose Dynamic Temperature Knowledge Distillation (DTKD) which introduces a dynamic, cooperative temperature control for both teacher and student models simultaneously.
arXiv Detail & Related papers (2024-04-19T08:40:52Z) - Revisiting Knowledge Distillation for Autoregressive Language Models [88.80146574509195]
We propose a simple yet effective adaptive teaching approach (ATKD) to improve the knowledge distillation (KD)
The core of ATKD is to reduce rote learning and make teaching more diverse and flexible.
Experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains.
arXiv Detail & Related papers (2024-02-19T07:01:10Z) - Robustness-Reinforced Knowledge Distillation with Correlation Distance
and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models.
Most existing KD techniques rely on Kullback-Leibler (KL) divergence.
We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - Gradient-Guided Knowledge Distillation for Object Detectors [3.236217153362305]
We propose a novel approach for knowledge distillation in object detection, named Gradient-guided Knowledge Distillation (GKD)
Our GKD uses gradient information to identify and assign more weights to features that significantly impact the detection loss, allowing the student to learn the most relevant features from the teacher.
Experiments on the KITTI and COCO-Traffic datasets demonstrate our method's efficacy in knowledge distillation for object detection.
arXiv Detail & Related papers (2023-03-07T21:09:09Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z) - Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A)
In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them.
Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.