Distilling Knowledge by Mimicking Features
- URL: http://arxiv.org/abs/2011.01424v2
- Date: Sat, 14 Aug 2021 01:38:50 GMT
- Title: Distilling Knowledge by Mimicking Features
- Authors: Guo-Hua Wang, Yifan Ge, Jianxin Wu
- Abstract summary: We argue that it is more advantageous to make the student mimic the teacher's features in the penultimate layer.
Not only the student can directly learn more effective information from the teacher feature, feature mimicking can also be applied for teachers trained without a softmax layer.
- Score: 32.79431807764681
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation (KD) is a popular method to train efficient networks
("student") with the help of high-capacity networks ("teacher"). Traditional
methods use the teacher's soft logits as extra supervision to train the student
network. In this paper, we argue that it is more advantageous to make the
student mimic the teacher's features in the penultimate layer. Not only the
student can directly learn more effective information from the teacher feature,
feature mimicking can also be applied for teachers trained without a softmax
layer. Experiments show that it can achieve higher accuracy than traditional
KD. To further facilitate feature mimicking, we decompose a feature vector into
the magnitude and the direction. We argue that the teacher should give more
freedom to the student feature's magnitude, and let the student pay more
attention on mimicking the feature direction. To meet this requirement, we
propose a loss term based on locality-sensitive hashing (LSH). With the help of
this new loss, our method indeed mimics feature directions more accurately,
relaxes constraints on feature magnitudes, and achieves state-of-the-art
distillation accuracy. We provide theoretical analyses of how LSH facilitates
feature direction mimicking, and further extend feature mimicking to
multi-label recognition and object detection.
Related papers
- Improving Knowledge Distillation with Teacher's Explanation [14.935696904019146]
We introduce a novel Knowledge Explaining Distillation (KED) framework.
KED allows the student to learn not only from the teacher's predictions but also from the teacher's explanations.
Our experiments over a variety of datasets show that KED students can substantially outperform KD students of similar complexity.
arXiv Detail & Related papers (2023-10-04T04:18:01Z) - Knowledge Distillation Layer that Lets the Student Decide [6.689381216751284]
We propose a learnable KD layer for the student which improves KD with two distinct abilities.
i) learning how to leverage the teacher's knowledge, enabling to discard nuisance information, and ii) feeding forward the transferred knowledge deeper.
arXiv Detail & Related papers (2023-09-06T09:05:03Z) - Improving Knowledge Distillation via Regularizing Feature Norm and
Direction [16.98806338782858]
Knowledge distillation (KD) exploits a large well-trained model (i.e., teacher) to train a small student model on the same dataset for the same task.
Treating teacher features as knowledge, prevailing methods of knowledge distillation train student by aligning its features with the teacher's, e.g., by minimizing the KL-divergence between their logits or L2 distance between their intermediate features.
While it is natural to believe that better alignment of student features to the teacher better distills teacher knowledge, simply forcing this alignment does not directly contribute to the student's performance, e.g.
arXiv Detail & Related papers (2023-05-26T15:05:19Z) - PrUE: Distilling Knowledge from Sparse Teacher Networks [4.087221125836262]
We present a pruning method termed Prediction Uncertainty Enlargement (PrUE) to simplify the teacher.
We empirically investigate the effectiveness of the proposed method with experiments on CIFAR-10/100, Tiny-ImageNet, and ImageNet.
Our method allows researchers to distill knowledge from deeper networks to improve students further.
arXiv Detail & Related papers (2022-07-03T08:14:24Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Faculty Distillation with Optimal Transport [53.69235109551099]
We propose to link teacher's task and student's task by optimal transport.
Based on the semantic relationship between their label spaces, we can bridge the support gap between output distributions.
Experiments under various settings demonstrate the succinctness and versatility of our method.
arXiv Detail & Related papers (2022-04-25T09:34:37Z) - Undistillable: Making A Nasty Teacher That CANNOT teach students [84.6111281091602]
This paper introduces and investigates a concept called Nasty Teacher: a specially trained teacher network that yields nearly the same performance as a normal one.
We propose a simple yet effective algorithm to build the nasty teacher, called self-undermining knowledge distillation.
arXiv Detail & Related papers (2021-05-16T08:41:30Z) - Distilling Knowledge via Knowledge Review [69.15050871776552]
We study the factor of connection path cross levels between teacher and student networks, and reveal its great importance.
For the first time in knowledge distillation, cross-stage connection paths are proposed.
Our finally designed nested and compact framework requires negligible overhead, and outperforms other methods on a variety of tasks.
arXiv Detail & Related papers (2021-04-19T04:36:24Z) - Locally Linear Region Knowledge Distillation [5.6592403195043826]
Knowledge distillation (KD) is an effective technique to transfer knowledge from one neural network (teacher) to another (student)
We argue that transferring knowledge at sparse training data points cannot enable the student to well capture the local shape of the teacher function.
We propose locally linear region knowledge distillation ($rm L2$RKD) which transfers the knowledge in local, linear regions from a teacher to a student.
arXiv Detail & Related papers (2020-10-09T21:23:53Z) - Interactive Knowledge Distillation [79.12866404907506]
We propose an InterActive Knowledge Distillation scheme to leverage the interactive teaching strategy for efficient knowledge distillation.
In the distillation process, the interaction between teacher and student networks is implemented by a swapping-in operation.
Experiments with typical settings of teacher-student networks demonstrate that the student networks trained by our IAKD achieve better performance than those trained by conventional knowledge distillation methods.
arXiv Detail & Related papers (2020-07-03T03:22:04Z) - Role-Wise Data Augmentation for Knowledge Distillation [48.115719640111394]
Knowledge Distillation (KD) is a common method for transferring the knowledge'' learned by one machine learning model into another.
We design data augmentation agents with distinct roles to facilitate knowledge distillation.
We find empirically that specially tailored data points enable the teacher's knowledge to be demonstrated more effectively to the student.
arXiv Detail & Related papers (2020-04-19T14:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.