Knowledge Transfer via Dense Cross-Layer Mutual-Distillation
- URL: http://arxiv.org/abs/2008.07816v1
- Date: Tue, 18 Aug 2020 09:25:08 GMT
- Title: Knowledge Transfer via Dense Cross-Layer Mutual-Distillation
- Authors: Anbang Yao, Dawei Sun
- Abstract summary: We propose Dense Cross-layer Mutual-distillation (DCM) in which the teacher and student networks are trained collaboratively from scratch.
To boost KT performance, we introduce dense bidirectional KD operations between the layers with appended classifiers.
We test our method on a variety of KT tasks, showing its superiorities over related methods.
- Score: 24.24969126783315
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge Distillation (KD) based methods adopt the one-way Knowledge
Transfer (KT) scheme in which training a lower-capacity student network is
guided by a pre-trained high-capacity teacher network. Recently, Deep Mutual
Learning (DML) presented a two-way KT strategy, showing that the student
network can be also helpful to improve the teacher network. In this paper, we
propose Dense Cross-layer Mutual-distillation (DCM), an improved two-way KT
method in which the teacher and student networks are trained collaboratively
from scratch. To augment knowledge representation learning, well-designed
auxiliary classifiers are added to certain hidden layers of both teacher and
student networks. To boost KT performance, we introduce dense bidirectional KD
operations between the layers appended with classifiers. After training, all
auxiliary classifiers are discarded, and thus there are no extra parameters
introduced to final models. We test our method on a variety of KT tasks,
showing its superiorities over related methods. Code is available at
https://github.com/sundw2014/DCM
Related papers
- Relative Difficulty Distillation for Semantic Segmentation [54.76143187709987]
We propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD)
RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals.
Our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.
arXiv Detail & Related papers (2024-07-04T08:08:25Z) - Adaptive Teaching with Shared Classifier for Knowledge Distillation [6.03477652126575]
Knowledge distillation (KD) is a technique used to transfer knowledge from a teacher network to a student network.
We propose adaptive teaching with a shared classifier (ATSC)
Our approach achieves state-of-the-art results on the CIFAR-100 and ImageNet datasets in both single-teacher and multiteacher scenarios.
arXiv Detail & Related papers (2024-06-12T08:51:08Z) - Online Knowledge Distillation via Mutual Contrastive Learning for Visual
Recognition [27.326420185846327]
We present a Mutual Contrastive Learning (MCL) framework for online Knowledge Distillation (KD)
Our MCL can aggregate cross-network embedding information and maximize the lower bound to the mutual information between two networks.
Experiments on image classification and transfer learning to visual recognition tasks show that layer-wise MCL can lead to consistent performance gains.
arXiv Detail & Related papers (2022-07-23T13:39:01Z) - Augmenting Knowledge Distillation With Peer-To-Peer Mutual Learning For
Model Compression [2.538209532048867]
Mutual Learning (ML) provides an alternative strategy where multiple simple student networks benefit from sharing knowledge.
We propose a single-teacher, multi-student framework that leverages both KD and ML to achieve better performance.
arXiv Detail & Related papers (2021-10-21T09:59:31Z) - Hierarchical Self-supervised Augmented Knowledge Distillation [1.9355744690301404]
We propose an alternative self-supervised augmented task to guide the network to learn the joint distribution of the original recognition task and self-supervised auxiliary task.
It is demonstrated as a richer knowledge to improve the representation power without losing the normal classification capability.
Our method significantly surpasses the previous SOTA SSKD with an average improvement of 2.56% on CIFAR-100 and an improvement of 0.77% on ImageNet.
arXiv Detail & Related papers (2021-07-29T02:57:21Z) - Distilling Knowledge via Knowledge Review [69.15050871776552]
We study the factor of connection path cross levels between teacher and student networks, and reveal its great importance.
For the first time in knowledge distillation, cross-stage connection paths are proposed.
Our finally designed nested and compact framework requires negligible overhead, and outperforms other methods on a variety of tasks.
arXiv Detail & Related papers (2021-04-19T04:36:24Z) - Wasserstein Contrastive Representation Distillation [114.24609306495456]
We propose Wasserstein Contrastive Representation Distillation (WCoRD), which leverages both primal and dual forms of Wasserstein distance for knowledge distillation.
The dual form is used for global knowledge transfer, yielding a contrastive learning objective that maximizes the lower bound of mutual information between the teacher and the student networks.
Experiments demonstrate that the proposed WCoRD method outperforms state-of-the-art approaches on privileged information distillation, model compression and cross-modal transfer.
arXiv Detail & Related papers (2020-12-15T23:43:28Z) - Heterogeneous Knowledge Distillation using Information Flow Modeling [82.83891707250926]
We propose a novel KD method that works by modeling the information flow through the various layers of the teacher model.
The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process.
arXiv Detail & Related papers (2020-05-02T06:56:56Z) - Inter-Region Affinity Distillation for Road Marking Segmentation [81.3619453527367]
We study the problem of distilling knowledge from a large deep teacher network to a much smaller student network.
Our method is known as Inter-Region Affinity KD (IntRA-KD)
arXiv Detail & Related papers (2020-04-11T04:26:37Z) - Efficient Crowd Counting via Structured Knowledge Transfer [122.30417437707759]
Crowd counting is an application-oriented task and its inference efficiency is crucial for real-world applications.
We propose a novel Structured Knowledge Transfer framework to generate a lightweight but still highly effective student network.
Our models obtain at least 6.5$times$ speed-up on an Nvidia 1080 GPU and even achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-03-23T08:05:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.