Rethinking Knowledge Distillation via Cross-Entropy
- URL: http://arxiv.org/abs/2208.10139v1
- Date: Mon, 22 Aug 2022 08:32:08 GMT
- Title: Rethinking Knowledge Distillation via Cross-Entropy
- Authors: Zhendong Yang, Zhe Li, Yuan Gong, Tianke Zhang, Shanshan Lao, Chun
Yuan, Yu Li
- Abstract summary: We try to decompose the KD loss to explore its relation with the CE loss.
We find it can be regarded as a combination of the CE loss and an extra loss which has the identical form as the CE loss.
In training without teachers, MobileNet, ResNet-18 and SwinTransformer-Tiny achieve 70.04%, 70.76%, and 81.48%, which are 0.83%, 0.86%, and 0.30% higher than the baseline, respectively.
- Score: 23.46801498161629
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge Distillation (KD) has developed extensively and boosted various
tasks. The classical KD method adds the KD loss to the original cross-entropy
(CE) loss. We try to decompose the KD loss to explore its relation with the CE
loss. Surprisingly, we find it can be regarded as a combination of the CE loss
and an extra loss which has the identical form as the CE loss. However, we
notice the extra loss forces the student's relative probability to learn the
teacher's absolute probability. Moreover, the sum of the two probabilities is
different, making it hard to optimize. To address this issue, we revise the
formulation and propose a distributed loss. In addition, we utilize teachers'
target output as the soft target, proposing the soft loss. Combining the soft
loss and the distributed loss, we propose a new KD loss (NKD). Furthermore, we
smooth students' target output to treat it as the soft target for training
without teachers and propose a teacher-free new KD loss (tf-NKD). Our method
achieves state-of-the-art performance on CIFAR-100 and ImageNet. For example,
with ResNet-34 as the teacher, we boost the ImageNet Top-1 accuracy of ResNet18
from 69.90% to 71.96%. In training without teachers, MobileNet, ResNet-18 and
SwinTransformer-Tiny achieve 70.04%, 70.76%, and 81.48%, which are 0.83%,
0.86%, and 0.30% higher than the baseline, respectively. The code is available
at https://github.com/yzd-v/cls_KD.
Related papers
- Efficient and Robust Knowledge Distillation from A Stronger Teacher Based on Correlation Matching [0.09999629695552192]
Correlation Matching Knowledge Distillation (CMKD) method combines the Pearson and Spearman correlation coefficients-based KD loss to achieve more efficient and robust distillation from a stronger teacher model.
CMKD is simple yet practical, and extensive experiments demonstrate that it can consistently achieve state-of-the-art performance on CIRAR-100 and ImageNet.
arXiv Detail & Related papers (2024-10-09T05:42:47Z) - Relational Representation Distillation [6.24302896438145]
We introduce Representation Distillation (RRD) to explore and reinforce relationships between teacher and student models.
Inspired by self-supervised learning principles, it uses a relaxed contrastive loss that focuses on similarity than exact replication.
Our approach demonstrates superior performance on CIFAR-100 and ImageNet ILSVRC-2012 and sometimes even outperforms the teacher network when combined with KD.
arXiv Detail & Related papers (2024-07-16T14:56:13Z) - Relative Difficulty Distillation for Semantic Segmentation [54.76143187709987]
We propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD)
RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals.
Our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.
arXiv Detail & Related papers (2024-07-04T08:08:25Z) - Decoupled Kullback-Leibler Divergence Loss [90.54331083430597]
We prove that the Kullback-Leibler (KL) Divergence loss is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss.
We introduce class-wise global information into KL/DKL to bias from individual samples.
The proposed approach achieves new state-of-the-art adversarial robustness on the public leaderboard.
arXiv Detail & Related papers (2023-05-23T11:17:45Z) - Grouped Knowledge Distillation for Deep Face Recognition [53.57402723008569]
The light-weight student network has difficulty fitting the target logits due to its low model capacity.
We propose a Grouped Knowledge Distillation (GKD) that retains the Primary-KD and Binary-KD but omits Secondary-KD in the ultimate KD loss calculation.
arXiv Detail & Related papers (2023-04-10T09:04:38Z) - From Knowledge Distillation to Self-Knowledge Distillation: A Unified
Approach with Normalized Loss and Customized Soft Labels [23.58665464454112]
Self-Knowledge Distillation (KD) uses the teacher's prediction logits as soft labels to guide the student.
Universal Self-Knowledge Distillation (USKD) generates customized soft labels for both target and non-target classes without a teacher.
arXiv Detail & Related papers (2023-03-23T02:59:36Z) - EvDistill: Asynchronous Events to End-task Learning via Bidirectional
Reconstruction-guided Cross-modal Knowledge Distillation [61.33010904301476]
Event cameras sense per-pixel intensity changes and produce asynchronous event streams with high dynamic range and less motion blur.
We propose a novel approach, called bfEvDistill, to learn a student network on the unlabeled and unpaired event data.
We show that EvDistill achieves significantly better results than the prior works and KD with only events and APS frames.
arXiv Detail & Related papers (2021-11-24T08:48:16Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in
Knowledge Distillation [9.157410884444312]
Knowledge distillation (KD) has been investigated to design efficient neural architectures.
We show that the KL divergence loss focuses on the logit matching when tau increases and the label matching when tau goes to 0.
We show that sequential distillation can improve performance and that KD, particularly when using the KL divergence loss with small tau, mitigates the label noise.
arXiv Detail & Related papers (2021-05-19T04:40:53Z) - Knowledge Distillation Thrives on Data Augmentation [65.58705111863814]
Knowledge distillation (KD) is a general deep neural network training framework that uses a teacher model to guide a student model.
Many works have explored the rationale for its success, however, its interplay with data augmentation (DA) has not been well recognized so far.
In this paper, we are motivated by an interesting observation in classification: KD loss can benefit from extended training iterations while the cross-entropy loss does not.
We show this disparity arises because of data augmentation: KD loss can tap into the extra information from different input views brought by DA.
arXiv Detail & Related papers (2020-12-05T00:32:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.