KDExplainer: A Task-oriented Attention Model for Explaining Knowledge
Distillation
- URL: http://arxiv.org/abs/2105.04181v2
- Date: Wed, 12 May 2021 11:54:17 GMT
- Title: KDExplainer: A Task-oriented Attention Model for Explaining Knowledge
Distillation
- Authors: Mengqi Xue, Jie Song, Xinchao Wang, Ying Chen, Xingen Wang, Mingli
Song
- Abstract summary: We introduce a novel task-oriented attention model, termed as KDExplainer, to shed light on the working mechanism underlying the vanilla KD.
We also introduce a portable tool, dubbed as virtual attention module (VAM), that can be seamlessly integrated with various deep neural networks (DNNs) to enhance their performance under KD.
- Score: 59.061835562314066
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation (KD) has recently emerged as an efficacious scheme for
learning compact deep neural networks (DNNs). Despite the promising results
achieved, the rationale that interprets the behavior of KD has yet remained
largely understudied. In this paper, we introduce a novel task-oriented
attention model, termed as KDExplainer, to shed light on the working mechanism
underlying the vanilla KD. At the heart of KDExplainer is a Hierarchical
Mixture of Experts (HME), in which a multi-class classification is reformulated
as a multi-task binary one. Through distilling knowledge from a free-form
pre-trained DNN to KDExplainer, we observe that KD implicitly modulates the
knowledge conflicts between different subtasks, and in reality has much more to
offer than label smoothing. Based on such findings, we further introduce a
portable tool, dubbed as virtual attention module (VAM), that can be seamlessly
integrated with various DNNs to enhance their performance under KD.
Experimental results demonstrate that with a negligible additional cost,
student models equipped with VAM consistently outperform their non-VAM
counterparts across different benchmarks. Furthermore, when combined with other
KD methods, VAM remains competent in promoting results, even though it is only
motivated by vanilla KD. The code is available at
https://github.com/zju-vipa/KDExplainer.
Related papers
- Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - Categories of Response-Based, Feature-Based, and Relation-Based
Knowledge Distillation [10.899753512019933]
Knowledge Distillation (KD) aims to optimize a lightweight network.
KD mainly involves knowledge extraction and distillation strategies.
This paper provides a comprehensive KD survey, including knowledge categories, distillation schemes and algorithms.
arXiv Detail & Related papers (2023-06-19T03:42:44Z) - Continuation KD: Improved Knowledge Distillation through the Lens of
Continuation Optimization [29.113990037893597]
Knowledge Distillation (KD) has been extensively used for natural language understanding (NLU) tasks to improve a small model's (a student) performance by transferring the knowledge from a larger model (a teacher)
Existing KD techniques do not mitigate noise in the teacher's output: noisy behaviour distracts the student from learning more useful teacher.
We propose a new KD method that addresses these problems compared to previous techniques.
arXiv Detail & Related papers (2022-12-12T16:00:20Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - Undistillable: Making A Nasty Teacher That CANNOT teach students [84.6111281091602]
This paper introduces and investigates a concept called Nasty Teacher: a specially trained teacher network that yields nearly the same performance as a normal one.
We propose a simple yet effective algorithm to build the nasty teacher, called self-undermining knowledge distillation.
arXiv Detail & Related papers (2021-05-16T08:41:30Z) - Distilling and Transferring Knowledge via cGAN-generated Samples for
Image Classification and Regression [17.12028267150745]
We propose a unified KD framework based on conditional generative adversarial networks (cGANs)
cGAN-KD distills and transfers knowledge from a teacher model to a student model via cGAN-generated samples.
Experiments on CIFAR-10 and Tiny-ImageNet show we can incorporate KD methods into the cGAN-KD framework to reach a new state of the art.
arXiv Detail & Related papers (2021-04-07T14:52:49Z) - Knowledge Distillation Thrives on Data Augmentation [65.58705111863814]
Knowledge distillation (KD) is a general deep neural network training framework that uses a teacher model to guide a student model.
Many works have explored the rationale for its success, however, its interplay with data augmentation (DA) has not been well recognized so far.
In this paper, we are motivated by an interesting observation in classification: KD loss can benefit from extended training iterations while the cross-entropy loss does not.
We show this disparity arises because of data augmentation: KD loss can tap into the extra information from different input views brought by DA.
arXiv Detail & Related papers (2020-12-05T00:32:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.