Knowledge From the Dark Side: Entropy-Reweighted Knowledge Distillation
  for Balanced Knowledge Transfer
        - URL: http://arxiv.org/abs/2311.13621v1
- Date: Wed, 22 Nov 2023 08:34:33 GMT
- Title: Knowledge From the Dark Side: Entropy-Reweighted Knowledge Distillation
  for Balanced Knowledge Transfer
- Authors: Chi-Ping Su, Ching-Hsun Tseng, Shin-Jye Lee
- Abstract summary: Distillation (KD) transfers knowledge from a larger "teacher" model to a student.
ERKD uses entropy in the teacher's predictions to reweight the KD loss on a sample-wise basis.
Our code is available at https://github.com/cpsu00/ER-KD.
- Score: 1.2606200500489302
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Knowledge Distillation (KD) transfers knowledge from a larger "teacher" model
to a compact "student" model, guiding the student with the "dark knowledge"
$\unicode{x2014}$ the implicit insights present in the teacher's soft
predictions. Although existing KDs have shown the potential of transferring
knowledge, the gap between the two parties still exists. With a series of
investigations, we argue the gap is the result of the student's overconfidence
in prediction, signaling an imbalanced focus on pronounced features while
overlooking the subtle yet crucial dark knowledge. To overcome this, we
introduce the Entropy-Reweighted Knowledge Distillation (ER-KD), a novel
approach that leverages the entropy in the teacher's predictions to reweight
the KD loss on a sample-wise basis. ER-KD precisely refocuses the student on
challenging instances rich in the teacher's nuanced insights while reducing the
emphasis on simpler cases, enabling a more balanced knowledge transfer.
Consequently, ER-KD not only demonstrates compatibility with various
state-of-the-art KD methods but also further enhances their performance at
negligible cost. This approach offers a streamlined and effective strategy to
refine the knowledge transfer process in KD, setting a new paradigm in the
meticulous handling of dark knowledge. Our code is available at
https://github.com/cpsu00/ER-KD.
 
      
        Related papers
        - Enhancing Reasoning Capabilities in SLMs with Reward Guided Dataset   Distillation [0.0]
 We propose AdvDistill, a reward-guided dataset distillation framework.<n>We utilise multiple generations (responses) from a teacher for each prompt and assign rewards based on rule-based verifiers.<n>These varying and normally distributed rewards serve as weights when training student models.
 arXiv  Detail & Related papers  (2025-06-25T20:07:47Z)
- Learning from Stochastic Teacher Representations Using Student-Guided   Knowledge Distillation [64.15918654558816]
 Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.
 Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
 arXiv  Detail & Related papers  (2025-04-19T14:08:56Z)
- Speculative Knowledge Distillation: Bridging the Teacher-Student Gap   Through Interleaved Sampling [81.00825302340984]
 We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
 arXiv  Detail & Related papers  (2024-10-15T06:51:25Z)
- Adaptive Explicit Knowledge Transfer for Knowledge Distillation [17.739979156009696]
 We show that the performance of logit-based knowledge distillation can be improved by effectively delivering the probability distribution for the non-target classes from the teacher model.
We propose a new loss that enables the student to learn explicit knowledge along with implicit knowledge in an adaptive manner.
 Experimental results demonstrate that the proposed method, called adaptive explicit knowledge transfer (AEKT) method, achieves improved performance compared to the state-of-the-art KD methods.
 arXiv  Detail & Related papers  (2024-09-03T07:42:59Z)
- Multi Teacher Privileged Knowledge Distillation for Multimodal   Expression Recognition [58.41784639847413]
 Human emotion is a complex phenomenon conveyed and perceived through facial expressions, vocal tones, body language, and physiological signals.
In this paper, a multi-teacher PKD (MT-PKDOT) method with self-distillation is introduced to align diverse teacher representations before distilling them to the student.
Results indicate that our proposed method can outperform SOTA PKD methods.
 arXiv  Detail & Related papers  (2024-08-16T22:11:01Z)
- Dynamic Temperature Knowledge Distillation [9.6046915661065]
 Temperature plays a pivotal role in moderating label softness in the realm of knowledge distillation (KD)
Traditional approaches often employ a static temperature throughout the KD process.
We propose Dynamic Temperature Knowledge Distillation (DTKD) which introduces a dynamic, cooperative temperature control for both teacher and student models simultaneously.
 arXiv  Detail & Related papers  (2024-04-19T08:40:52Z)
- Revisiting Knowledge Distillation for Autoregressive Language Models [88.80146574509195]
 We propose a simple yet effective adaptive teaching approach (ATKD) to improve the knowledge distillation (KD)
The core of ATKD is to reduce rote learning and make teaching more diverse and flexible.
Experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains.
 arXiv  Detail & Related papers  (2024-02-19T07:01:10Z)
- Robustness-Reinforced Knowledge Distillation with Correlation Distance
  and Network Pruning [3.1423836318272773]
 Knowledge distillation (KD) improves the performance of efficient and lightweight models.
Most existing KD techniques rely on Kullback-Leibler (KL) divergence.
We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
 arXiv  Detail & Related papers  (2023-11-23T11:34:48Z)
- Comparative Knowledge Distillation [102.35425896967791]
 Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
 arXiv  Detail & Related papers  (2023-11-03T21:55:33Z)
- Towards Understanding and Improving Knowledge Distillation for Neural   Machine Translation [59.31690622031927]
 We show that the knowledge comes from the top-1 predictions of teachers.
We propose a novel method named textbfTop-1 textbfInformation textbfEnhanced textbfKnowledge textbfDistillation (TIE-KD)
 arXiv  Detail & Related papers  (2023-05-14T08:23:03Z)
- Gradient-Guided Knowledge Distillation for Object Detectors [3.236217153362305]
 We propose a novel approach for knowledge distillation in object detection, named Gradient-guided Knowledge Distillation (GKD)
Our GKD uses gradient information to identify and assign more weights to features that significantly impact the detection loss, allowing the student to learn the most relevant features from the teacher.
Experiments on the KITTI and COCO-Traffic datasets demonstrate our method's efficacy in knowledge distillation for object detection.
 arXiv  Detail & Related papers  (2023-03-07T21:09:09Z)
- Exploring Inconsistent Knowledge Distillation for Object Detection with
  Data Augmentation [66.25738680429463]
 Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
 arXiv  Detail & Related papers  (2022-09-20T16:36:28Z)
- Knowledge Condensation Distillation [38.446333274732126]
 Existing methods focus on excavating the knowledge hints and transferring the whole knowledge to the student.
In this paper, we propose Knowledge Condensation Distillation (KCD)
Our approach is easy to build on top of the off-the-shelf KD methods, with no extra training parameters and negligible overhead.
 arXiv  Detail & Related papers  (2022-07-12T09:17:34Z)
- Undistillable: Making A Nasty Teacher That CANNOT teach students [84.6111281091602]
 This paper introduces and investigates a concept called Nasty Teacher: a specially trained teacher network that yields nearly the same performance as a normal one.
We propose a simple yet effective algorithm to build the nasty teacher, called self-undermining knowledge distillation.
 arXiv  Detail & Related papers  (2021-05-16T08:41:30Z)
- KDExplainer: A Task-oriented Attention Model for Explaining Knowledge
  Distillation [59.061835562314066]
 We introduce a novel task-oriented attention model, termed as KDExplainer, to shed light on the working mechanism underlying the vanilla KD.
We also introduce a portable tool, dubbed as virtual attention module (VAM), that can be seamlessly integrated with various deep neural networks (DNNs) to enhance their performance under KD.
 arXiv  Detail & Related papers  (2021-05-10T08:15:26Z)
- Knowledge Distillation Thrives on Data Augmentation [65.58705111863814]
 Knowledge distillation (KD) is a general deep neural network training framework that uses a teacher model to guide a student model.
Many works have explored the rationale for its success, however, its interplay with data augmentation (DA) has not been well recognized so far.
In this paper, we are motivated by an interesting observation in classification: KD loss can benefit from extended training iterations while the cross-entropy loss does not.
We show this disparity arises because of data augmentation: KD loss can tap into the extra information from different input views brought by DA.
 arXiv  Detail & Related papers  (2020-12-05T00:32:04Z)
- Residual Knowledge Distillation [96.18815134719975]
 This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A)
In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them.
Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
 arXiv  Detail & Related papers  (2020-02-21T07:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.