Knowledge Distillation Using Hierarchical Self-Supervision Augmented
Distribution
- URL: http://arxiv.org/abs/2109.03075v1
- Date: Tue, 7 Sep 2021 13:29:32 GMT
- Title: Knowledge Distillation Using Hierarchical Self-Supervision Augmented
Distribution
- Authors: Chuanguang Yang, Zhulin An, Linhang Cai, and Yongjun Xu
- Abstract summary: We propose an auxiliary self-supervision augmented task to guide networks to learn more meaningful features.
Unlike previous knowledge, this distribution encodes joint knowledge from supervised and self-supervised feature learning.
We call our KD method as Hierarchical Self-Supervision Augmented Knowledge Distillation (HSSAKD)
- Score: 1.7718093866806544
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation (KD) is an effective framework that aims to transfer
meaningful information from a large teacher to a smaller student. Generally, KD
often involves how to define and transfer knowledge. Previous KD methods often
focus on mining various forms of knowledge, for example, feature maps and
refined information. However, the knowledge is derived from the primary
supervised task and thus is highly task-specific. Motivated by the recent
success of self-supervised representation learning, we propose an auxiliary
self-supervision augmented task to guide networks to learn more meaningful
features. Therefore, we can derive soft self-supervision augmented
distributions as richer dark knowledge from this task for KD. Unlike previous
knowledge, this distribution encodes joint knowledge from supervised and
self-supervised feature learning. Beyond knowledge exploration, another crucial
aspect is how to learn and distill our proposed knowledge effectively. To fully
take advantage of hierarchical feature maps, we propose to append several
auxiliary branches at various hidden layers. Each auxiliary branch is guided to
learn self-supervision augmented task and distill this distribution from
teacher to student. Thus we call our KD method as Hierarchical Self-Supervision
Augmented Knowledge Distillation (HSSAKD). Experiments on standard image
classification show that both offline and online HSSAKD achieves
state-of-the-art performance in the field of KD. Further transfer experiments
on object detection further verify that HSSAKD can guide the network to learn
better features, which can be attributed to learn and distill an auxiliary
self-supervision augmented task effectively.
Related papers
- Adaptive Explicit Knowledge Transfer for Knowledge Distillation [17.739979156009696]
We show that the performance of logit-based knowledge distillation can be improved by effectively delivering the probability distribution for the non-target classes from the teacher model.
We propose a new loss that enables the student to learn explicit knowledge along with implicit knowledge in an adaptive manner.
Experimental results demonstrate that the proposed method, called adaptive explicit knowledge transfer (AEKT) method, achieves improved performance compared to the state-of-the-art KD methods.
arXiv Detail & Related papers (2024-09-03T07:42:59Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - Knowledge Condensation Distillation [38.446333274732126]
Existing methods focus on excavating the knowledge hints and transferring the whole knowledge to the student.
In this paper, we propose Knowledge Condensation Distillation (KCD)
Our approach is easy to build on top of the off-the-shelf KD methods, with no extra training parameters and negligible overhead.
arXiv Detail & Related papers (2022-07-12T09:17:34Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Hierarchical Self-supervised Augmented Knowledge Distillation [1.9355744690301404]
We propose an alternative self-supervised augmented task to guide the network to learn the joint distribution of the original recognition task and self-supervised auxiliary task.
It is demonstrated as a richer knowledge to improve the representation power without losing the normal classification capability.
Our method significantly surpasses the previous SOTA SSKD with an average improvement of 2.56% on CIFAR-100 and an improvement of 0.77% on ImageNet.
arXiv Detail & Related papers (2021-07-29T02:57:21Z) - Undistillable: Making A Nasty Teacher That CANNOT teach students [84.6111281091602]
This paper introduces and investigates a concept called Nasty Teacher: a specially trained teacher network that yields nearly the same performance as a normal one.
We propose a simple yet effective algorithm to build the nasty teacher, called self-undermining knowledge distillation.
arXiv Detail & Related papers (2021-05-16T08:41:30Z) - KDExplainer: A Task-oriented Attention Model for Explaining Knowledge
Distillation [59.061835562314066]
We introduce a novel task-oriented attention model, termed as KDExplainer, to shed light on the working mechanism underlying the vanilla KD.
We also introduce a portable tool, dubbed as virtual attention module (VAM), that can be seamlessly integrated with various deep neural networks (DNNs) to enhance their performance under KD.
arXiv Detail & Related papers (2021-05-10T08:15:26Z) - Refine Myself by Teaching Myself: Feature Refinement via Self-Knowledge
Distillation [12.097302014936655]
This paper proposes a novel self-knowledge distillation method, Feature Refinement via Self-Knowledge Distillation (FRSKD)
Our proposed method, FRSKD, can utilize both soft label and feature-map distillations for the self-knowledge distillation.
We demonstrate the effectiveness of FRSKD by enumerating its performance improvements in diverse tasks and benchmark datasets.
arXiv Detail & Related papers (2021-03-15T10:59:43Z) - Knowledge Distillation Thrives on Data Augmentation [65.58705111863814]
Knowledge distillation (KD) is a general deep neural network training framework that uses a teacher model to guide a student model.
Many works have explored the rationale for its success, however, its interplay with data augmentation (DA) has not been well recognized so far.
In this paper, we are motivated by an interesting observation in classification: KD loss can benefit from extended training iterations while the cross-entropy loss does not.
We show this disparity arises because of data augmentation: KD loss can tap into the extra information from different input views brought by DA.
arXiv Detail & Related papers (2020-12-05T00:32:04Z) - Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network.
We show that the seemingly different self-supervision task can serve as a simple yet powerful solution.
By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.