A Closer Look at Knowledge Distillation with Features, Logits, and
Gradients
- URL: http://arxiv.org/abs/2203.10163v1
- Date: Fri, 18 Mar 2022 21:26:55 GMT
- Title: A Closer Look at Knowledge Distillation with Features, Logits, and
Gradients
- Authors: Yen-Chang Hsu, James Smith, Yilin Shen, Zsolt Kira, Hongxia Jin
- Abstract summary: Knowledge distillation (KD) is a substantial strategy for transferring learned knowledge from one neural network model to another.
This work provides a new perspective to motivate a set of knowledge distillation strategies by approximating the classical KL-divergence criteria with different knowledge sources.
Our analysis indicates that logits are generally a more efficient knowledge source and suggests that having sufficient feature dimensions is crucial for the model design.
- Score: 81.39206923719455
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Knowledge distillation (KD) is a substantial strategy for transferring
learned knowledge from one neural network model to another. A vast number of
methods have been developed for this strategy. While most method designs a more
efficient way to facilitate knowledge transfer, less attention has been put on
comparing the effect of knowledge sources such as features, logits, and
gradients. This work provides a new perspective to motivate a set of knowledge
distillation strategies by approximating the classical KL-divergence criteria
with different knowledge sources, making a systematic comparison possible in
model compression and incremental learning. Our analysis indicates that logits
are generally a more efficient knowledge source and suggests that having
sufficient feature dimensions is crucial for the model design, providing a
practical guideline for effective KD-based transfer learning.
Related papers
- Adaptive Explicit Knowledge Transfer for Knowledge Distillation [17.739979156009696]
We show that the performance of logit-based knowledge distillation can be improved by effectively delivering the probability distribution for the non-target classes from the teacher model.
We propose a new loss that enables the student to learn explicit knowledge along with implicit knowledge in an adaptive manner.
Experimental results demonstrate that the proposed method, called adaptive explicit knowledge transfer (AEKT) method, achieves improved performance compared to the state-of-the-art KD methods.
arXiv Detail & Related papers (2024-09-03T07:42:59Z) - Hint-dynamic Knowledge Distillation [30.40008256306688]
Hint-dynamic Knowledge Distillation, dubbed HKD, excavates the knowledge from the teacher's hints in a dynamic scheme.
A meta-weight network is introduced to generate the instance-wise weight coefficients about knowledge hints.
Experiments on standard benchmarks of CIFAR-100 and Tiny-ImageNet manifest that the proposed HKD well boost the effect of knowledge distillation tasks.
arXiv Detail & Related papers (2022-11-30T15:03:53Z) - Efficient training of lightweight neural networks using Online
Self-Acquired Knowledge Distillation [51.66271681532262]
Online Self-Acquired Knowledge Distillation (OSAKD) is proposed, aiming to improve the performance of any deep neural model in an online manner.
We utilize k-nn non-parametric density estimation technique for estimating the unknown probability distributions of the data samples in the output feature space.
arXiv Detail & Related papers (2021-08-26T14:01:04Z) - Collaborative Teacher-Student Learning via Multiple Knowledge Transfer [79.45526596053728]
We propose a collaborative teacher-student learning via multiple knowledge transfer (CTSL-MKT)
It allows multiple students learn knowledge from both individual instances and instance relations in a collaborative way.
The experiments and ablation studies on four image datasets demonstrate that the proposed CTSL-MKT significantly outperforms the state-of-the-art KD methods.
arXiv Detail & Related papers (2021-01-21T07:17:04Z) - On the Orthogonality of Knowledge Distillation with Other Techniques:
From an Ensemble Perspective [34.494730096460636]
We show that knowledge distillation is a powerful apparatus for practical deployment of efficient neural network.
We also introduce ways to integrate knowledge distillation with other methods effectively.
arXiv Detail & Related papers (2020-09-09T06:14:59Z) - Knowledge Distillation Beyond Model Compression [13.041607703862724]
Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or ensemble of models (teacher)
In this study, we provide an extensive study on nine different KD methods which covers a broad spectrum of approaches to capture and transfer knowledge.
arXiv Detail & Related papers (2020-07-03T19:54:04Z) - Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network.
We show that the seemingly different self-supervision task can serve as a simple yet powerful solution.
By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z) - Heterogeneous Knowledge Distillation using Information Flow Modeling [82.83891707250926]
We propose a novel KD method that works by modeling the information flow through the various layers of the teacher model.
The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process.
arXiv Detail & Related papers (2020-05-02T06:56:56Z) - Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A)
In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them.
Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.