Decoupled Knowledge Distillation
- URL: http://arxiv.org/abs/2203.08679v1
- Date: Wed, 16 Mar 2022 15:07:47 GMT
- Title: Decoupled Knowledge Distillation
- Authors: Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, Jiajun Liang
- Abstract summary: We reformulate the classical KD loss into two parts, i.e., target class knowledge distillation (TCKD) and non-target class knowledge distillation (NCKD)
TCKD transfers knowledge concerning the "difficulty" of training samples, while NCKD is the prominent reason why logit distillation works.
We present Decoupled Knowledge Distillation (DKD), enabling TCKD and NCKD to play their roles more efficiently and flexibly.
- Score: 7.049113958508325
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: State-of-the-art distillation methods are mainly based on distilling deep
features from intermediate layers, while the significance of logit distillation
is greatly overlooked. To provide a novel viewpoint to study logit
distillation, we reformulate the classical KD loss into two parts, i.e., target
class knowledge distillation (TCKD) and non-target class knowledge distillation
(NCKD). We empirically investigate and prove the effects of the two parts: TCKD
transfers knowledge concerning the "difficulty" of training samples, while NCKD
is the prominent reason why logit distillation works. More importantly, we
reveal that the classical KD loss is a coupled formulation, which (1)
suppresses the effectiveness of NCKD and (2) limits the flexibility to balance
these two parts. To address these issues, we present Decoupled Knowledge
Distillation (DKD), enabling TCKD and NCKD to play their roles more efficiently
and flexibly. Compared with complex feature-based methods, our DKD achieves
comparable or even better results and has better training efficiency on
CIFAR-100, ImageNet, and MS-COCO datasets for image classification and object
detection tasks. This paper proves the great potential of logit distillation,
and we hope it will be helpful for future research. The code is available at
https://github.com/megvii-research/mdistiller.
Related papers
- Knowledge Distillation with Refined Logits [31.205248790623703]
We introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods.
Our approach is motivated by the observation that even high-performing teacher models can make incorrect predictions.
Our method can effectively eliminate misleading information from the teacher while preserving crucial class correlations.
arXiv Detail & Related papers (2024-08-14T17:59:32Z) - Grouped Knowledge Distillation for Deep Face Recognition [53.57402723008569]
The light-weight student network has difficulty fitting the target logits due to its low model capacity.
We propose a Grouped Knowledge Distillation (GKD) that retains the Primary-KD and Binary-KD but omits Secondary-KD in the ultimate KD loss calculation.
arXiv Detail & Related papers (2023-04-10T09:04:38Z) - Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency.
Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model.
We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self
On-the-fly Distillation for Dense Passage Retrieval [54.54667085792404]
We propose a novel distillation method that significantly advances cross-architecture distillation for dual-encoders.
Our method 1) introduces a self on-the-fly distillation method that can effectively distill late interaction (i.e., ColBERT) to vanilla dual-encoder, and 2) incorporates a cascade distillation process to further improve the performance with a cross-encoder teacher.
arXiv Detail & Related papers (2022-05-18T18:05:13Z) - Spot-adaptive Knowledge Distillation [39.23627955442595]
We propose a new distillation strategy, termed spot-adaptive KD (SAKD)
SAKD adaptively determines the distillation spots in the teacher network per sample, at every training iteration during the whole distillation period.
Experiments with 10 state-of-the-art distillers are conducted to demonstrate the effectiveness of SAKD.
arXiv Detail & Related papers (2022-05-05T02:21:32Z) - Localization Distillation for Object Detection [134.12664548771534]
Previous knowledge distillation (KD) methods for object detection mostly focus on feature imitation instead of mimicking the classification logits.
We present a novel localization distillation (LD) method which can efficiently transfer the localization knowledge from the teacher to the student.
We show that logit mimicking can outperform feature imitation and the absence of localization distillation is a critical reason for why logit mimicking underperforms for years.
arXiv Detail & Related papers (2022-04-12T17:14:34Z) - Knowledge Distillation Thrives on Data Augmentation [65.58705111863814]
Knowledge distillation (KD) is a general deep neural network training framework that uses a teacher model to guide a student model.
Many works have explored the rationale for its success, however, its interplay with data augmentation (DA) has not been well recognized so far.
In this paper, we are motivated by an interesting observation in classification: KD loss can benefit from extended training iterations while the cross-entropy loss does not.
We show this disparity arises because of data augmentation: KD loss can tap into the extra information from different input views brought by DA.
arXiv Detail & Related papers (2020-12-05T00:32:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.