Knowledge Distillation Meets Self-Supervision
- URL: http://arxiv.org/abs/2006.07114v2
- Date: Mon, 13 Jul 2020 09:14:27 GMT
- Title: Knowledge Distillation Meets Self-Supervision
- Authors: Guodong Xu, Ziwei Liu, Xiaoxiao Li, Chen Change Loy
- Abstract summary: Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network.
We show that the seemingly different self-supervision task can serve as a simple yet powerful solution.
By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
- Score: 109.6400639148393
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation, which involves extracting the "dark knowledge" from a
teacher network to guide the learning of a student network, has emerged as an
important technique for model compression and transfer learning. Unlike
previous works that exploit architecture-specific cues such as activation and
attention for distillation, here we wish to explore a more general and
model-agnostic approach for extracting "richer dark knowledge" from the
pre-trained teacher model. We show that the seemingly different
self-supervision task can serve as a simple yet powerful solution. For example,
when performing contrastive learning between transformed entities, the noisy
predictions of the teacher network reflect its intrinsic composition of
semantic and pose information. By exploiting the similarity between those
self-supervision signals as an auxiliary task, one can effectively transfer the
hidden information from the teacher to the student. In this paper, we discuss
practical ways to exploit those noisy self-supervision signals with selective
transfer for distillation. We further show that self-supervision signals
improve conventional distillation with substantial gains under few-shot and
noisy-label scenarios. Given the richer knowledge mined from self-supervision,
our knowledge distillation approach achieves state-of-the-art performance on
standard benchmarks, i.e., CIFAR100 and ImageNet, under both
similar-architecture and cross-architecture settings. The advantage is even
more pronounced under the cross-architecture setting, where our method
outperforms the state of the art CRD by an average of 2.3% in accuracy rate on
CIFAR100 across six different teacher-student pairs.
Related papers
- Faithful Label-free Knowledge Distillation [8.572967695281054]
This paper presents a label-free knowledge distillation approach called Teacher in the Middle (TinTeM)
It produces a more faithful student, which better replicates the behavior of the teacher network across a range of benchmarks testing model robustness, generalisability and out-of-distribution detection.
arXiv Detail & Related papers (2024-11-22T01:48:44Z) - A Probabilistic Model Behind Self-Supervised Learning [53.64989127914936]
In self-supervised learning (SSL), representations are learned via an auxiliary task without annotated labels.
We present a generative latent variable model for self-supervised learning.
We show that several families of discriminative SSL, including contrastive methods, induce a comparable distribution over representations.
arXiv Detail & Related papers (2024-02-02T13:31:17Z) - Knowledge Distillation via Token-level Relationship Graph [12.356770685214498]
We propose a novel method called Knowledge Distillation with Token-level Relationship Graph (TRG)
By employing TRG, the student model can effectively emulate higher-level semantic information from the teacher model.
We conduct experiments to evaluate the effectiveness of the proposed method against several state-of-the-art approaches.
arXiv Detail & Related papers (2023-06-20T08:16:37Z) - Distilling Knowledge from Self-Supervised Teacher by Embedding Graph
Alignment [52.704331909850026]
We formulate a new knowledge distillation framework to transfer the knowledge from self-supervised pre-trained models to any other student network.
Inspired by the spirit of instance discrimination in self-supervised learning, we model the instance-instance relations by a graph formulation in the feature embedding space.
Our distillation scheme can be flexibly applied to transfer the self-supervised knowledge to enhance representation learning on various student networks.
arXiv Detail & Related papers (2022-11-23T19:27:48Z) - Learning Knowledge Representation with Meta Knowledge Distillation for
Single Image Super-Resolution [82.89021683451432]
We propose a model-agnostic meta knowledge distillation method under the teacher-student architecture for the single image super-resolution task.
Experiments conducted on various single image super-resolution datasets demonstrate that our proposed method outperforms existing defined knowledge representation related distillation methods.
arXiv Detail & Related papers (2022-07-18T02:41:04Z) - Knowledge Distillation Meets Open-Set Semi-Supervised Learning [69.21139647218456]
We propose a novel em modelname (bfem shortname) method dedicated for distilling representational knowledge semantically from a pretrained teacher to a target student.
At the problem level, this establishes an interesting connection between knowledge distillation with open-set semi-supervised learning (SSL)
Our shortname outperforms significantly previous state-of-the-art knowledge distillation methods on both coarse object classification and fine face recognition tasks.
arXiv Detail & Related papers (2022-05-13T15:15:27Z) - On the benefits of knowledge distillation for adversarial robustness [53.41196727255314]
We show that knowledge distillation can be used directly to boost the performance of state-of-the-art models in adversarial robustness.
We present Adversarial Knowledge Distillation (AKD), a new framework to improve a model's robust performance.
arXiv Detail & Related papers (2022-03-14T15:02:13Z) - Hierarchical Self-supervised Augmented Knowledge Distillation [1.9355744690301404]
We propose an alternative self-supervised augmented task to guide the network to learn the joint distribution of the original recognition task and self-supervised auxiliary task.
It is demonstrated as a richer knowledge to improve the representation power without losing the normal classification capability.
Our method significantly surpasses the previous SOTA SSKD with an average improvement of 2.56% on CIFAR-100 and an improvement of 0.77% on ImageNet.
arXiv Detail & Related papers (2021-07-29T02:57:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.