Knowledge Distillation Layer that Lets the Student Decide
- URL: http://arxiv.org/abs/2309.02843v1
- Date: Wed, 6 Sep 2023 09:05:03 GMT
- Title: Knowledge Distillation Layer that Lets the Student Decide
- Authors: Ada Gorgun, Yeti Z. Gurbuz, A. Aydin Alatan
- Abstract summary: We propose a learnable KD layer for the student which improves KD with two distinct abilities.
i) learning how to leverage the teacher's knowledge, enabling to discard nuisance information, and ii) feeding forward the transferred knowledge deeper.
- Score: 6.689381216751284
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Typical technique in knowledge distillation (KD) is regularizing the learning
of a limited capacity model (student) by pushing its responses to match a
powerful model's (teacher). Albeit useful especially in the penultimate layer
and beyond, its action on student's feature transform is rather implicit,
limiting its practice in the intermediate layers. To explicitly embed the
teacher's knowledge in feature transform, we propose a learnable KD layer for
the student which improves KD with two distinct abilities: i) learning how to
leverage the teacher's knowledge, enabling to discard nuisance information, and
ii) feeding forward the transferred knowledge deeper. Thus, the student enjoys
the teacher's knowledge during the inference besides training. Formally, we
repurpose 1x1-BN-ReLU-1x1 convolution block to assign a semantic vector to each
local region according to the template (supervised by the teacher) that the
corresponding region of the student matches. To facilitate template learning in
the intermediate layers, we propose a novel form of supervision based on the
teacher's decisions. Through rigorous experimentation, we demonstrate the
effectiveness of our approach on 3 popular classification benchmarks. Code is
available at: https://github.com/adagorgun/letKD-framework
Related papers
- Improving Knowledge Distillation with Teacher's Explanation [14.935696904019146]
We introduce a novel Knowledge Explaining Distillation (KED) framework.
KED allows the student to learn not only from the teacher's predictions but also from the teacher's explanations.
Our experiments over a variety of datasets show that KED students can substantially outperform KD students of similar complexity.
arXiv Detail & Related papers (2023-10-04T04:18:01Z) - Cross Architecture Distillation for Face Recognition [49.55061794917994]
We develop an Adaptable Prompting Teacher network (APT) that integrates prompts into the teacher, enabling it to manage distillation-specific knowledge.
Experiments on popular face benchmarks and two large-scale verification sets demonstrate the superiority of our method.
arXiv Detail & Related papers (2023-06-26T12:54:28Z) - Improving Knowledge Distillation via Regularizing Feature Norm and
Direction [16.98806338782858]
Knowledge distillation (KD) exploits a large well-trained model (i.e., teacher) to train a small student model on the same dataset for the same task.
Treating teacher features as knowledge, prevailing methods of knowledge distillation train student by aligning its features with the teacher's, e.g., by minimizing the KL-divergence between their logits or L2 distance between their intermediate features.
While it is natural to believe that better alignment of student features to the teacher better distills teacher knowledge, simply forcing this alignment does not directly contribute to the student's performance, e.g.
arXiv Detail & Related papers (2023-05-26T15:05:19Z) - Switchable Online Knowledge Distillation [68.2673580932132]
Online Knowledge Distillation (OKD) improves involved models by reciprocally exploiting the difference between teacher and student.
We propose Switchable Online Knowledge Distillation (SwitOKD) to answer these questions.
arXiv Detail & Related papers (2022-09-12T03:03:40Z) - Knowledge Condensation Distillation [38.446333274732126]
Existing methods focus on excavating the knowledge hints and transferring the whole knowledge to the student.
In this paper, we propose Knowledge Condensation Distillation (KCD)
Our approach is easy to build on top of the off-the-shelf KD methods, with no extra training parameters and negligible overhead.
arXiv Detail & Related papers (2022-07-12T09:17:34Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Generalized Knowledge Distillation via Relationship Matching [53.69235109551099]
Knowledge of a well-trained deep neural network (a.k.a. the "teacher") is valuable for learning similar tasks.
Knowledge distillation extracts knowledge from the teacher and integrates it with the target model.
Instead of enforcing the teacher to work on the same task as the student, we borrow the knowledge from a teacher trained from a general label space.
arXiv Detail & Related papers (2022-05-04T06:49:47Z) - Undistillable: Making A Nasty Teacher That CANNOT teach students [84.6111281091602]
This paper introduces and investigates a concept called Nasty Teacher: a specially trained teacher network that yields nearly the same performance as a normal one.
We propose a simple yet effective algorithm to build the nasty teacher, called self-undermining knowledge distillation.
arXiv Detail & Related papers (2021-05-16T08:41:30Z) - Wasserstein Contrastive Representation Distillation [114.24609306495456]
We propose Wasserstein Contrastive Representation Distillation (WCoRD), which leverages both primal and dual forms of Wasserstein distance for knowledge distillation.
The dual form is used for global knowledge transfer, yielding a contrastive learning objective that maximizes the lower bound of mutual information between the teacher and the student networks.
Experiments demonstrate that the proposed WCoRD method outperforms state-of-the-art approaches on privileged information distillation, model compression and cross-modal transfer.
arXiv Detail & Related papers (2020-12-15T23:43:28Z) - Distilling Knowledge by Mimicking Features [32.79431807764681]
We argue that it is more advantageous to make the student mimic the teacher's features in the penultimate layer.
Not only the student can directly learn more effective information from the teacher feature, feature mimicking can also be applied for teachers trained without a softmax layer.
arXiv Detail & Related papers (2020-11-03T02:15:14Z) - Knowledge Transfer via Dense Cross-Layer Mutual-Distillation [24.24969126783315]
We propose Dense Cross-layer Mutual-distillation (DCM) in which the teacher and student networks are trained collaboratively from scratch.
To boost KT performance, we introduce dense bidirectional KD operations between the layers with appended classifiers.
We test our method on a variety of KT tasks, showing its superiorities over related methods.
arXiv Detail & Related papers (2020-08-18T09:25:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.