What is Lost in Knowledge Distillation?
- URL: http://arxiv.org/abs/2311.04142v1
- Date: Tue, 7 Nov 2023 17:13:40 GMT
- Title: What is Lost in Knowledge Distillation?
- Authors: Manas Mohanty, Tanya Roosta, Peyman Passban
- Abstract summary: Deep neural networks (DNNs) have improved NLP tasks significantly, but training and maintaining such networks could be costly.
Model compression techniques, such as, knowledge distillation (KD), have been proposed to address the issue.
Our work investigates how a distilled student model differs from its teacher, if the distillation process causes any information losses, and if the loss follows a specific pattern.
- Score: 4.1205832766381985
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural networks (DNNs) have improved NLP tasks significantly, but
training and maintaining such networks could be costly. Model compression
techniques, such as, knowledge distillation (KD), have been proposed to address
the issue; however, the compression process could be lossy. Motivated by this,
our work investigates how a distilled student model differs from its teacher,
if the distillation process causes any information losses, and if the loss
follows a specific pattern. Our experiments aim to shed light on the type of
tasks might be less or more sensitive to KD by reporting data points on the
contribution of different factors, such as the number of layers or attention
heads. Results such as ours could be utilized when determining effective and
efficient configurations to achieve optimal information transfers between
larger (teacher) and smaller (student) models.
Related papers
- Relative Difficulty Distillation for Semantic Segmentation [54.76143187709987]
We propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD)
RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals.
Our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.
arXiv Detail & Related papers (2024-07-04T08:08:25Z) - Robustness-Reinforced Knowledge Distillation with Correlation Distance
and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models.
Most existing KD techniques rely on Kullback-Leibler (KL) divergence.
We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z) - Knowledge Diffusion for Distillation [53.908314960324915]
The representation gap between teacher and student is an emerging topic in knowledge distillation (KD)
We state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature.
We propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models.
arXiv Detail & Related papers (2023-05-25T04:49:34Z) - Improved knowledge distillation by utilizing backward pass knowledge in
neural networks [17.437510399431606]
Knowledge distillation (KD) is one of the prominent techniques for model compression.
In this work, we generate new auxiliary training samples based on extracting knowledge from the backward pass of the teacher.
We show how this technique can be used successfully in applications of natural language processing (NLP) and language understanding.
arXiv Detail & Related papers (2023-01-27T22:07:38Z) - Parameter-Efficient and Student-Friendly Knowledge Distillation [83.56365548607863]
We present a parameter-efficient and student-friendly knowledge distillation method, namely PESF-KD, to achieve efficient and sufficient knowledge transfer.
Experiments on a variety of benchmarks show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.
arXiv Detail & Related papers (2022-05-28T16:11:49Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - Preventing Catastrophic Forgetting and Distribution Mismatch in
Knowledge Distillation via Synthetic Data [5.064036314529226]
We propose a data-free KD framework that maintains a dynamic collection of generated samples over time.
Our experiments demonstrate that we can improve the accuracy of the student models obtained via KD when compared with state-of-the-art approaches.
arXiv Detail & Related papers (2021-08-11T08:11:08Z) - Annealing Knowledge Distillation [5.396407687999048]
We propose an improved knowledge distillation method (called Annealing-KD) by feeding the rich information provided by the teacher's soft-targets incrementally and more efficiently.
This paper includes theoretical and empirical evidence as well as practical experiments to support the effectiveness of our Annealing-KD method.
arXiv Detail & Related papers (2021-04-14T23:45:03Z) - Knowledge Distillation Thrives on Data Augmentation [65.58705111863814]
Knowledge distillation (KD) is a general deep neural network training framework that uses a teacher model to guide a student model.
Many works have explored the rationale for its success, however, its interplay with data augmentation (DA) has not been well recognized so far.
In this paper, we are motivated by an interesting observation in classification: KD loss can benefit from extended training iterations while the cross-entropy loss does not.
We show this disparity arises because of data augmentation: KD loss can tap into the extra information from different input views brought by DA.
arXiv Detail & Related papers (2020-12-05T00:32:04Z) - Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A)
In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them.
Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.