ResKD: Residual-Guided Knowledge Distillation
- URL: http://arxiv.org/abs/2006.04719v4
- Date: Tue, 9 Mar 2021 03:35:20 GMT
- Title: ResKD: Residual-Guided Knowledge Distillation
- Authors: Xuewei Li, Songyuan Li, Bourahla Omar, Fei Wu, and Xi Li
- Abstract summary: We see knowledge distillation in a fresh light, using the knowledge gap, or the residual, between a teacher and a student as guidance.
We combine the student and the res-student into a new student, where the res-student rectifies the errors of the former student.
We achieve competitive performance with 18.04$%$, 23.14$%$, 53.59$%$, and 56.86$%$ of the teachers' computational costs.
- Score: 22.521831561264534
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation, aimed at transferring the knowledge from a heavy
teacher network to a lightweight student network, has emerged as a promising
technique for compressing neural networks. However, due to the capacity gap
between the heavy teacher and the lightweight student, there still exists a
significant performance gap between them. In this paper, we see knowledge
distillation in a fresh light, using the knowledge gap, or the residual,
between a teacher and a student as guidance to train a much more lightweight
student, called a res-student. We combine the student and the res-student into
a new student, where the res-student rectifies the errors of the former
student. Such a residual-guided process can be repeated until the user strikes
the balance between accuracy and cost. At inference time, we propose a
sample-adaptive strategy to decide which res-students are not necessary for
each sample, which can save computational cost. Experimental results show that
we achieve competitive performance with 18.04$\%$, 23.14$\%$, 53.59$\%$, and
56.86$\%$ of the teachers' computational costs on the CIFAR-10, CIFAR-100,
Tiny-ImageNet, and ImageNet datasets. Finally, we do thorough theoretical and
empirical analysis for our method.
Related papers
- Distilling Calibrated Student from an Uncalibrated Teacher [8.101116303448586]
We study how to obtain a student from an uncalibrated teacher.
Our approach relies on the fusion of data-augmentation techniques, including but not limited to cutout, mixup, and CutMix.
We extend our approach beyond traditional knowledge distillation and find it suitable as well.
arXiv Detail & Related papers (2023-02-22T16:18:38Z) - Improving Ensemble Distillation With Weight Averaging and Diversifying
Perturbation [22.87106703794863]
It motivates distilling knowledge from the ensemble teacher into a smaller student network.
We propose a weight averaging technique where a student with multipleworks is trained to absorb the functional diversity of ensemble teachers.
We also propose a perturbation strategy that seeks inputs from which the diversities of teachers can be better transferred to the student.
arXiv Detail & Related papers (2022-06-30T06:23:03Z) - Parameter-Efficient and Student-Friendly Knowledge Distillation [83.56365548607863]
We present a parameter-efficient and student-friendly knowledge distillation method, namely PESF-KD, to achieve efficient and sufficient knowledge transfer.
Experiments on a variety of benchmarks show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.
arXiv Detail & Related papers (2022-05-28T16:11:49Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Computation-Efficient Knowledge Distillation via Uncertainty-Aware Mixup [91.1317510066954]
We study a little-explored but important question, i.e., knowledge distillation efficiency.
Our goal is to achieve a performance comparable to conventional knowledge distillation with a lower computation cost during training.
We show that the UNcertainty-aware mIXup (UNIX) can serve as a clean yet effective solution.
arXiv Detail & Related papers (2020-12-17T06:52:16Z) - Reducing the Teacher-Student Gap via Spherical Knowledge Disitllation [67.75526580926149]
Knowledge distillation aims at obtaining a compact and effective model by learning the mapping function from a much larger one.
We investigate the capacity gap problem by study the gap of confidence between teacher and student.
We find that the magnitude of confidence is not necessary for knowledge distillation and could harm the student performance if the student are forced to learn confidence.
arXiv Detail & Related papers (2020-10-15T03:03:36Z) - Distilling Object Detectors with Task Adaptive Regularization [97.52935611385179]
Current state-of-the-art object detectors are at the expense of high computational costs and are hard to deploy to low-end devices.
Knowledge distillation, which aims at training a smaller student network by transferring knowledge from a larger teacher model, is one of the promising solutions for model miniaturization.
arXiv Detail & Related papers (2020-06-23T15:58:22Z) - Teacher-Class Network: A Neural Network Compression Mechanism [2.257416403770908]
Instead of transferring knowledge to one student only, the proposed method transfers a chunk of knowledge to each student.
Our students are not trained for problem-specific logits, they are trained to mimic knowledge (dense representation) learned by the teacher network.
The proposed teacher-class architecture is evaluated on several benchmark datasets such as MNIST, Fashion MNIST, IMDB Movie Reviews, CAMVid, CIFAR-10 and ImageNet.
arXiv Detail & Related papers (2020-04-07T11:31:20Z) - Neural Networks Are More Productive Teachers Than Human Raters: Active
Mixup for Data-Efficient Knowledge Distillation from a Blackbox Model [57.41841346459995]
We study how to train a student deep neural network for visual recognition by distilling knowledge from a blackbox teacher model in a data-efficient manner.
We propose an approach that blends mixup and active learning.
arXiv Detail & Related papers (2020-03-31T05:44:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.