Revisiting Knowledge Distillation: An Inheritance and Exploration
Framework
- URL: http://arxiv.org/abs/2107.00181v1
- Date: Thu, 1 Jul 2021 02:20:56 GMT
- Title: Revisiting Knowledge Distillation: An Inheritance and Exploration
Framework
- Authors: Zhen Huang, Xu Shen, Jun Xing, Tongliang Liu, Xinmei Tian, Houqiang
Li, Bing Deng, Jianqiang Huang and Xian-Sheng Hua
- Abstract summary: Knowledge Distillation (KD) is a popular technique to transfer knowledge from a teacher model to a student model.
We propose a novel inheritance and exploration knowledge distillation framework (IE-KD)
Our IE-KD framework is generic and can be easily combined with existing distillation or mutual learning methods for training deep neural networks.
- Score: 153.73692961660964
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Knowledge Distillation (KD) is a popular technique to transfer knowledge from
a teacher model or ensemble to a student model. Its success is generally
attributed to the privileged information on similarities/consistency between
the class distributions or intermediate feature representations of the teacher
model and the student model. However, directly pushing the student model to
mimic the probabilities/features of the teacher model to a large extent limits
the student model in learning undiscovered knowledge/features. In this paper,
we propose a novel inheritance and exploration knowledge distillation framework
(IE-KD), in which a student model is split into two parts - inheritance and
exploration. The inheritance part is learned with a similarity loss to transfer
the existing learned knowledge from the teacher model to the student model,
while the exploration part is encouraged to learn representations different
from the inherited ones with a dis-similarity loss. Our IE-KD framework is
generic and can be easily combined with existing distillation or mutual
learning methods for training deep neural networks. Extensive experiments
demonstrate that these two parts can jointly push the student model to learn
more diversified and effective representations, and our IE-KD can be a general
technique to improve the student network to achieve SOTA performance.
Furthermore, by applying our IE-KD to the training of two networks, the
performance of both can be improved w.r.t. deep mutual learning. The code and
models of IE-KD will be make publicly available at
https://github.com/yellowtownhz/IE-KD.
Related papers
- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Leveraging Different Learning Styles for Improved Knowledge Distillation
in Biomedical Imaging [0.9208007322096533]
Our work endeavors to leverage the concept of knowledge diversification to improve the performance of model compression techniques like Knowledge Distillation (KD) and Mutual Learning (ML)
We use a single-teacher and two-student network in a unified framework that not only allows for the transfer of knowledge from teacher to students (KD) but also encourages collaborative learning between students (ML)
Unlike the conventional approach, where the teacher shares the same knowledge in the form of predictions or feature representations with the student network, our proposed approach employs a more diversified strategy by training one student with predictions and the other with feature maps from the teacher.
arXiv Detail & Related papers (2022-12-06T12:40:45Z) - Extracting knowledge from features with multilevel abstraction [3.4443503349903124]
Self-knowledge distillation (SKD) aims at transferring the knowledge from a large teacher model to a small student model.
In this paper, we purpose a novel SKD method in a different way from the main stream methods.
Experiments and ablation studies show its great effectiveness and generalization on various kinds of tasks.
arXiv Detail & Related papers (2021-12-04T02:25:46Z) - Undistillable: Making A Nasty Teacher That CANNOT teach students [84.6111281091602]
This paper introduces and investigates a concept called Nasty Teacher: a specially trained teacher network that yields nearly the same performance as a normal one.
We propose a simple yet effective algorithm to build the nasty teacher, called self-undermining knowledge distillation.
arXiv Detail & Related papers (2021-05-16T08:41:30Z) - Collaborative Teacher-Student Learning via Multiple Knowledge Transfer [79.45526596053728]
We propose a collaborative teacher-student learning via multiple knowledge transfer (CTSL-MKT)
It allows multiple students learn knowledge from both individual instances and instance relations in a collaborative way.
The experiments and ablation studies on four image datasets demonstrate that the proposed CTSL-MKT significantly outperforms the state-of-the-art KD methods.
arXiv Detail & Related papers (2021-01-21T07:17:04Z) - Wasserstein Contrastive Representation Distillation [114.24609306495456]
We propose Wasserstein Contrastive Representation Distillation (WCoRD), which leverages both primal and dual forms of Wasserstein distance for knowledge distillation.
The dual form is used for global knowledge transfer, yielding a contrastive learning objective that maximizes the lower bound of mutual information between the teacher and the student networks.
Experiments demonstrate that the proposed WCoRD method outperforms state-of-the-art approaches on privileged information distillation, model compression and cross-modal transfer.
arXiv Detail & Related papers (2020-12-15T23:43:28Z) - Multi-level Knowledge Distillation [13.71183256776644]
We introduce Multi-level Knowledge Distillation (MLKD) to transfer richer representational knowledge from teacher to student networks.
MLKD employs three novel teacher-student similarities: individual similarity, relational similarity, and categorical similarity.
Experiments demonstrate that MLKD outperforms other state-of-the-art methods on both similar-architecture and cross-architecture tasks.
arXiv Detail & Related papers (2020-12-01T15:27:15Z) - Knowledge Distillation Beyond Model Compression [13.041607703862724]
Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or ensemble of models (teacher)
In this study, we provide an extensive study on nine different KD methods which covers a broad spectrum of approaches to capture and transfer knowledge.
arXiv Detail & Related papers (2020-07-03T19:54:04Z) - Role-Wise Data Augmentation for Knowledge Distillation [48.115719640111394]
Knowledge Distillation (KD) is a common method for transferring the knowledge'' learned by one machine learning model into another.
We design data augmentation agents with distinct roles to facilitate knowledge distillation.
We find empirically that specially tailored data points enable the teacher's knowledge to be demonstrated more effectively to the student.
arXiv Detail & Related papers (2020-04-19T14:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.