Knowledge Distillation with Deep Supervision
- URL: http://arxiv.org/abs/2202.07846v2
- Date: Thu, 25 May 2023 14:07:50 GMT
- Title: Knowledge Distillation with Deep Supervision
- Authors: Shiya Luo, Defang Chen, Can Wang
- Abstract summary: We propose Deeply-Supervised Knowledge Distillation (DSKD), which fully utilizes class predictions and feature maps of the teacher model to supervise the training of shallow student layers.
A loss-based weight allocation strategy is developed in DSKD to adaptively balance the learning process of each shallow layer, so as to further improve the student performance.
- Score: 6.8080936803807734
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation aims to enhance the performance of a lightweight
student model by exploiting the knowledge from a pre-trained cumbersome teacher
model. However, in the traditional knowledge distillation, teacher predictions
are only used to provide the supervisory signal for the last layer of the
student model, which may result in those shallow student layers lacking
accurate training guidance in the layer-by-layer back propagation and thus
hinders effective knowledge transfer. To address this issue, we propose
Deeply-Supervised Knowledge Distillation (DSKD), which fully utilizes class
predictions and feature maps of the teacher model to supervise the training of
shallow student layers. A loss-based weight allocation strategy is developed in
DSKD to adaptively balance the learning process of each shallow layer, so as to
further improve the student performance. Extensive experiments on CIFAR-100 and
TinyImageNet with various teacher-student models show significantly
performance, confirming the effectiveness of our proposed method. Code is
available at:
$\href{https://github.com/luoshiya/DSKD}{https://github.com/luoshiya/DSKD}$
Related papers
- Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Revisiting Knowledge Distillation for Autoregressive Language Models [88.80146574509195]
We propose a simple yet effective adaptive teaching approach (ATKD) to improve the knowledge distillation (KD)
The core of ATKD is to reduce rote learning and make teaching more diverse and flexible.
Experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains.
arXiv Detail & Related papers (2024-02-19T07:01:10Z) - Data Upcycling Knowledge Distillation for Image Super-Resolution [25.753554952896096]
Knowledge distillation (KD) compresses deep neural networks by transferring task-related knowledge from pre-trained teacher models to compact student models.
We present the Data Upcycling Knowledge Distillation (DUKD) to transfer the teacher model's knowledge to the student model through the upcycled in-domain data derived from training data.
arXiv Detail & Related papers (2023-09-25T14:13:26Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Dynamic Rectification Knowledge Distillation [0.0]
Dynamic Rectification Knowledge Distillation (DR-KD) is a knowledge distillation framework.
DR-KD transforms the student into its own teacher, and if the self-teacher makes wrong predictions while distilling information, the error is rectified prior to the knowledge being distilled.
Our proposed DR-KD performs remarkably well in the absence of a sophisticated cumbersome teacher model.
arXiv Detail & Related papers (2022-01-27T04:38:01Z) - Semi-Online Knowledge Distillation [2.373824287636486]
Conventional knowledge distillation (KD) is to transfer knowledge from a large and well pre-trained teacher network to a small student network.
Deep mutual learning (DML) has been proposed to help student networks learn collaboratively and simultaneously.
We propose a Semi-Online Knowledge Distillation (SOKD) method that effectively improves the performance of the student and the teacher.
arXiv Detail & Related papers (2021-11-23T09:44:58Z) - Learning to Teach with Student Feedback [67.41261090761834]
Interactive Knowledge Distillation (IKD) allows the teacher to learn to teach from the feedback of the student.
IKD trains the teacher model to generate specific soft target at each training step for a certain student.
Joint optimization for both teacher and student is achieved by two iterative steps.
arXiv Detail & Related papers (2021-09-10T03:01:01Z) - Boosting Light-Weight Depth Estimation Via Knowledge Distillation [21.93879961636064]
We propose a lightweight network that can accurately estimate depth maps using minimal computing resources.
We achieve this by designing a compact model architecture that maximally reduces model complexity.
Our method achieves comparable performance to state-of-the-art methods while using only 1% of their parameters.
arXiv Detail & Related papers (2021-05-13T08:42:42Z) - Annealing Knowledge Distillation [5.396407687999048]
We propose an improved knowledge distillation method (called Annealing-KD) by feeding the rich information provided by the teacher's soft-targets incrementally and more efficiently.
This paper includes theoretical and empirical evidence as well as practical experiments to support the effectiveness of our Annealing-KD method.
arXiv Detail & Related papers (2021-04-14T23:45:03Z) - Heterogeneous Knowledge Distillation using Information Flow Modeling [82.83891707250926]
We propose a novel KD method that works by modeling the information flow through the various layers of the teacher model.
The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process.
arXiv Detail & Related papers (2020-05-02T06:56:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.