Related papers: What is Lost in Knowledge Distillation?

What is Lost in Knowledge Distillation?

URL: http://arxiv.org/abs/2311.04142v1
Date: Tue, 7 Nov 2023 17:13:40 GMT
Title: What is Lost in Knowledge Distillation?
Authors: Manas Mohanty, Tanya Roosta, Peyman Passban
Abstract summary: Deep neural networks (DNNs) have improved NLP tasks significantly, but training and maintaining such networks could be costly. Model compression techniques, such as, knowledge distillation (KD), have been proposed to address the issue. Our work investigates how a distilled student model differs from its teacher, if the distillation process causes any information losses, and if the loss follows a specific pattern.
Score: 4.1205832766381985
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep neural networks (DNNs) have improved NLP tasks significantly, but training and maintaining such networks could be costly. Model compression techniques, such as, knowledge distillation (KD), have been proposed to address the issue; however, the compression process could be lossy. Motivated by this, our work investigates how a distilled student model differs from its teacher, if the distillation process causes any information losses, and if the loss follows a specific pattern. Our experiments aim to shed light on the type of tasks might be less or more sensitive to KD by reporting data points on the contribution of different factors, such as the number of layers or attention heads. Results such as ours could be utilized when determining effective and efficient configurations to achieve optimal information transfers between larger (teacher) and smaller (student) models.

Related papers

Improving Knowledge Distillation Under Unknown Covariate Shift Through Confidence-Guided Data Augmentation [29.552309706623138]
knowledge distillation has become an established tool for transferring knowledge from foundation models to small student networks.<n>This work addresses the common practical issue of covariate shift in knowledge distillation, where spurious features appear during training but not at test time.<n>We introduce a novel diffusion-based data augmentation strategy that generates images by maximizing the disagreement between the teacher and the student.<n>Experiments demonstrate that our approach significantly improves worst group and mean group accuracy on CelebA and SpuCo Birds as well as the spurious mAUC on spurious ImageNet.
arXiv Detail & Related papers (2025-06-02T22:15:59Z)
Importance Analysis for Dynamic Control of Balancing Parameter in a Simple Knowledge Distillation Setting [2.5536554335016417]
knowledge distillation (KD) stands out for its strong empirical performance.<n> KD is most effective when the influence of the distillation loss outweighs that of the downstream-task loss.<n>This paper provides a mathematical rationale showing that in a simple KD setting when the loss is decreasing, the balancing parameter should be dynamically adjusted.
arXiv Detail & Related papers (2025-05-06T04:04:30Z)
Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only. Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
Relative Difficulty Distillation for Semantic Segmentation [54.76143187709987]
We propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD) RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals. Our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.
arXiv Detail & Related papers (2024-07-04T08:08:25Z)
Robustness-Reinforced Knowledge Distillation with Correlation Distance and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models. Most existing KD techniques rely on Kullback-Leibler (KL) divergence. We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z)
Knowledge Diffusion for Distillation [53.908314960324915]
The representation gap between teacher and student is an emerging topic in knowledge distillation (KD) We state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature. We propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models.
arXiv Detail & Related papers (2023-05-25T04:49:34Z)
Improved knowledge distillation by utilizing backward pass knowledge in neural networks [17.437510399431606]
Knowledge distillation (KD) is one of the prominent techniques for model compression. In this work, we generate new auxiliary training samples based on extracting knowledge from the backward pass of the teacher. We show how this technique can be used successfully in applications of natural language processing (NLP) and language understanding.
arXiv Detail & Related papers (2023-01-27T22:07:38Z)
Parameter-Efficient and Student-Friendly Knowledge Distillation [83.56365548607863]
We present a parameter-efficient and student-friendly knowledge distillation method, namely PESF-KD, to achieve efficient and sufficient knowledge transfer. Experiments on a variety of benchmarks show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.
arXiv Detail & Related papers (2022-05-28T16:11:49Z)
How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD) We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy. Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z)
Preventing Catastrophic Forgetting and Distribution Mismatch in Knowledge Distillation via Synthetic Data [5.064036314529226]
We propose a data-free KD framework that maintains a dynamic collection of generated samples over time. Our experiments demonstrate that we can improve the accuracy of the student models obtained via KD when compared with state-of-the-art approaches.
arXiv Detail & Related papers (2021-08-11T08:11:08Z)
Annealing Knowledge Distillation [5.396407687999048]
We propose an improved knowledge distillation method (called Annealing-KD) by feeding the rich information provided by the teacher's soft-targets incrementally and more efficiently. This paper includes theoretical and empirical evidence as well as practical experiments to support the effectiveness of our Annealing-KD method.
arXiv Detail & Related papers (2021-04-14T23:45:03Z)
Knowledge Distillation Thrives on Data Augmentation [65.58705111863814]
Knowledge distillation (KD) is a general deep neural network training framework that uses a teacher model to guide a student model. Many works have explored the rationale for its success, however, its interplay with data augmentation (DA) has not been well recognized so far. In this paper, we are motivated by an interesting observation in classification: KD loss can benefit from extended training iterations while the cross-entropy loss does not. We show this disparity arises because of data augmentation: KD loss can tap into the extra information from different input views brought by DA.
arXiv Detail & Related papers (2020-12-05T00:32:04Z)
Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A) In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them. Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.