From Knowledge Distillation to Self-Knowledge Distillation: A Unified
Approach with Normalized Loss and Customized Soft Labels
- URL: http://arxiv.org/abs/2303.13005v2
- Date: Mon, 17 Jul 2023 12:22:21 GMT
- Title: From Knowledge Distillation to Self-Knowledge Distillation: A Unified
Approach with Normalized Loss and Customized Soft Labels
- Authors: Zhendong Yang, Ailing Zeng, Zhe Li, Tianke Zhang, Chun Yuan, Yu Li
- Abstract summary: Self-Knowledge Distillation (KD) uses the teacher's prediction logits as soft labels to guide the student.
Universal Self-Knowledge Distillation (USKD) generates customized soft labels for both target and non-target classes without a teacher.
- Score: 23.58665464454112
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge Distillation (KD) uses the teacher's prediction logits as soft
labels to guide the student, while self-KD does not need a real teacher to
require the soft labels. This work unifies the formulations of the two tasks by
decomposing and reorganizing the generic KD loss into a Normalized KD (NKD)
loss and customized soft labels for both target class (image's category) and
non-target classes named Universal Self-Knowledge Distillation (USKD). We
decompose the KD loss and find the non-target loss from it forces the student's
non-target logits to match the teacher's, but the sum of the two non-target
logits is different, preventing them from being identical. NKD normalizes the
non-target logits to equalize their sum. It can be generally used for KD and
self-KD to better use the soft labels for distillation loss. USKD generates
customized soft labels for both target and non-target classes without a
teacher. It smooths the target logit of the student as the soft target label
and uses the rank of the intermediate feature to generate the soft non-target
labels with Zipf's law. For KD with teachers, our NKD achieves state-of-the-art
performance on CIFAR-100 and ImageNet datasets, boosting the ImageNet Top-1
accuracy of ResNet18 from 69.90% to 71.96% with a ResNet-34 teacher. For
self-KD without teachers, USKD is the first self-KD method that can be
effectively applied to both CNN and ViT models with negligible additional time
and memory cost, resulting in new state-of-the-art results, such as 1.17% and
0.55% accuracy gains on ImageNet for MobileNet and DeiT-Tiny, respectively. Our
codes are available at https://github.com/yzd-v/cls_KD.
Related papers
- CLIP-KD: An Empirical Study of CLIP Model Distillation [24.52910358842176]
This paper aims to distill small CLIP models supervised by a large teacher CLIP model.
We show that a simple feature mimicry with Mean Squared Error loss works surprisingly well.
interactive contrastive learning across teacher and student encoders is also effective in performance improvement.
arXiv Detail & Related papers (2023-07-24T12:24:07Z) - CrossKD: Cross-Head Knowledge Distillation for Object Detection [69.16346256926842]
Knowledge Distillation (KD) has been validated as an effective model compression technique for learning compact object detectors.
We present a prediction mimicking distillation scheme, called CrossKD, which delivers the intermediate features of the student's detection head to the teacher's detection head.
Our CrossKD boosts the average precision of GFL ResNet-50 with 1x training schedule from 40.2 to 43.7, outperforming all existing KD methods.
arXiv Detail & Related papers (2023-06-20T08:19:51Z) - Grouped Knowledge Distillation for Deep Face Recognition [53.57402723008569]
The light-weight student network has difficulty fitting the target logits due to its low model capacity.
We propose a Grouped Knowledge Distillation (GKD) that retains the Primary-KD and Binary-KD but omits Secondary-KD in the ultimate KD loss calculation.
arXiv Detail & Related papers (2023-04-10T09:04:38Z) - Rethinking Knowledge Distillation via Cross-Entropy [23.46801498161629]
We try to decompose the KD loss to explore its relation with the CE loss.
We find it can be regarded as a combination of the CE loss and an extra loss which has the identical form as the CE loss.
In training without teachers, MobileNet, ResNet-18 and SwinTransformer-Tiny achieve 70.04%, 70.76%, and 81.48%, which are 0.83%, 0.86%, and 0.30% higher than the baseline, respectively.
arXiv Detail & Related papers (2022-08-22T08:32:08Z) - ALM-KD: Knowledge Distillation with noisy labels via adaptive loss
mixing [25.49637460661711]
Knowledge distillation is a technique where the outputs of a pretrained model are used for training a student model in a supervised setting.
We tackle this problem via the use of an adaptive loss mixing scheme during KD.
We demonstrate performance gains obtained using our approach in the standard KD setting as well as in multi-teacher and self-distillation settings.
arXiv Detail & Related papers (2022-02-07T14:53:22Z) - A Fast Knowledge Distillation Framework for Visual Recognition [17.971973892352864]
Fast Knowledge Distillation (FKD) framework replicates the distillation training phase and generates soft labels using the multi-crop KD approach.
Our FKD is even more efficient than the traditional image classification framework.
arXiv Detail & Related papers (2021-12-02T18:59:58Z) - EvDistill: Asynchronous Events to End-task Learning via Bidirectional
Reconstruction-guided Cross-modal Knowledge Distillation [61.33010904301476]
Event cameras sense per-pixel intensity changes and produce asynchronous event streams with high dynamic range and less motion blur.
We propose a novel approach, called bfEvDistill, to learn a student network on the unlabeled and unpaired event data.
We show that EvDistill achieves significantly better results than the prior works and KD with only events and APS frames.
arXiv Detail & Related papers (2021-11-24T08:48:16Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - Learning to Teach with Student Feedback [67.41261090761834]
Interactive Knowledge Distillation (IKD) allows the teacher to learn to teach from the feedback of the student.
IKD trains the teacher model to generate specific soft target at each training step for a certain student.
Joint optimization for both teacher and student is achieved by two iterative steps.
arXiv Detail & Related papers (2021-09-10T03:01:01Z) - Undistillable: Making A Nasty Teacher That CANNOT teach students [84.6111281091602]
This paper introduces and investigates a concept called Nasty Teacher: a specially trained teacher network that yields nearly the same performance as a normal one.
We propose a simple yet effective algorithm to build the nasty teacher, called self-undermining knowledge distillation.
arXiv Detail & Related papers (2021-05-16T08:41:30Z) - Learning from a Lightweight Teacher for Efficient Knowledge Distillation [14.865673786025525]
This paper proposes LW-KD, short for lightweight knowledge distillation.
It firstly trains a lightweight teacher network on a synthesized simple dataset, with an adjustable class number equal to that of a target dataset.
The teacher then generates soft target whereby an enhanced KD loss could guide student learning, which is a combination of KD loss and adversarial loss for making student output indistinguishable from the output of the teacher.
arXiv Detail & Related papers (2020-05-19T01:54:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.