Knowledge Distillation Thrives on Data Augmentation
- URL: http://arxiv.org/abs/2012.02909v1
- Date: Sat, 5 Dec 2020 00:32:04 GMT
- Title: Knowledge Distillation Thrives on Data Augmentation
- Authors: Huan Wang, Suhas Lohit, Michael Jones, Yun Fu
- Abstract summary: Knowledge distillation (KD) is a general deep neural network training framework that uses a teacher model to guide a student model.
Many works have explored the rationale for its success, however, its interplay with data augmentation (DA) has not been well recognized so far.
In this paper, we are motivated by an interesting observation in classification: KD loss can benefit from extended training iterations while the cross-entropy loss does not.
We show this disparity arises because of data augmentation: KD loss can tap into the extra information from different input views brought by DA.
- Score: 65.58705111863814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation (KD) is a general deep neural network training
framework that uses a teacher model to guide a student model. Many works have
explored the rationale for its success, however, its interplay with data
augmentation (DA) has not been well recognized so far. In this paper, we are
motivated by an interesting observation in classification: KD loss can benefit
from extended training iterations while the cross-entropy loss does not. We
show this disparity arises because of data augmentation: KD loss can tap into
the extra information from different input views brought by DA. By this
explanation, we propose to enhance KD via a stronger data augmentation scheme
(e.g., mixup, CutMix). Furthermore, an even stronger new DA approach is
developed specifically for KD based on the idea of active learning. The
findings and merits of the proposed method are validated by extensive
experiments on CIFAR-100, Tiny ImageNet, and ImageNet datasets. We can achieve
improved performance simply by using the original KD loss combined with
stronger augmentation schemes, compared to existing state-of-the-art methods,
which employ more advanced distillation losses. In addition, when our
approaches are combined with more advanced distillation losses, we can advance
the state-of-the-art performance even more. On top of the encouraging
performance, this paper also sheds some light on explaining the success of
knowledge distillation. The discovered interplay between KD and DA may inspire
more advanced KD algorithms.
Related papers
- Adaptive Explicit Knowledge Transfer for Knowledge Distillation [17.739979156009696]
We show that the performance of logit-based knowledge distillation can be improved by effectively delivering the probability distribution for the non-target classes from the teacher model.
We propose a new loss that enables the student to learn explicit knowledge along with implicit knowledge in an adaptive manner.
Experimental results demonstrate that the proposed method, called adaptive explicit knowledge transfer (AEKT) method, achieves improved performance compared to the state-of-the-art KD methods.
arXiv Detail & Related papers (2024-09-03T07:42:59Z) - Relative Difficulty Distillation for Semantic Segmentation [54.76143187709987]
We propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD)
RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals.
Our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.
arXiv Detail & Related papers (2024-07-04T08:08:25Z) - Revisiting Knowledge Distillation for Autoregressive Language Models [88.80146574509195]
We propose a simple yet effective adaptive teaching approach (ATKD) to improve the knowledge distillation (KD)
The core of ATKD is to reduce rote learning and make teaching more diverse and flexible.
Experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains.
arXiv Detail & Related papers (2024-02-19T07:01:10Z) - Robustness-Reinforced Knowledge Distillation with Correlation Distance
and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models.
Most existing KD techniques rely on Kullback-Leibler (KL) divergence.
We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - CILDA: Contrastive Data Augmentation using Intermediate Layer Knowledge
Distillation [30.56389761245621]
Knowledge distillation (KD) is an efficient framework for compressing large-scale pre-trained language models.
Recent years have seen a surge of research aiming to improve KD by leveraging Contrastive Learning, Intermediate Layer Distillation, Data Augmentation, and Adrial Training.
We propose a learning based data augmentation technique tailored for knowledge distillation, called CILDA.
arXiv Detail & Related papers (2022-04-15T23:16:37Z) - KDExplainer: A Task-oriented Attention Model for Explaining Knowledge
Distillation [59.061835562314066]
We introduce a novel task-oriented attention model, termed as KDExplainer, to shed light on the working mechanism underlying the vanilla KD.
We also introduce a portable tool, dubbed as virtual attention module (VAM), that can be seamlessly integrated with various deep neural networks (DNNs) to enhance their performance under KD.
arXiv Detail & Related papers (2021-05-10T08:15:26Z) - Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A)
In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them.
Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.