SimReg: Regression as a Simple Yet Effective Tool for Self-supervised
Knowledge Distillation
- URL: http://arxiv.org/abs/2201.05131v1
- Date: Thu, 13 Jan 2022 18:41:46 GMT
- Title: SimReg: Regression as a Simple Yet Effective Tool for Self-supervised
Knowledge Distillation
- Authors: K L Navaneet, Soroush Abbasi Koohpayegani, Ajinkya Tejankar, Hamed
Pirsiavash
- Abstract summary: Feature regression is a simple way to distill large neural network models to smaller ones.
We show that with simple changes to the network architecture, regression can outperform more complex state-of-the-art approaches for knowledge distillation.
- Score: 14.739041141948032
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Feature regression is a simple way to distill large neural network models to
smaller ones. We show that with simple changes to the network architecture,
regression can outperform more complex state-of-the-art approaches for
knowledge distillation from self-supervised models. Surprisingly, the addition
of a multi-layer perceptron head to the CNN backbone is beneficial even if used
only during distillation and discarded in the downstream task. Deeper
non-linear projections can thus be used to accurately mimic the teacher without
changing inference architecture and time. Moreover, we utilize independent
projection heads to simultaneously distill multiple teacher networks. We also
find that using the same weakly augmented image as input for both teacher and
student networks aids distillation. Experiments on ImageNet dataset demonstrate
the efficacy of the proposed changes in various self-supervised distillation
settings.
Related papers
- Understanding the Gains from Repeated Self-Distillation [65.53673000292079]
Self-Distillation is a type of knowledge distillation where the student model has the same architecture as the teacher model.
We show that the excess risk achieved by multi-step self-distillation can significantly improve upon a single step of self-distillation.
Empirical results on regression tasks from the UCI repository show a reduction in the learnt model's risk (MSE) by up to 47%.
arXiv Detail & Related papers (2024-07-05T15:48:34Z) - Distilling Knowledge from Self-Supervised Teacher by Embedding Graph
Alignment [52.704331909850026]
We formulate a new knowledge distillation framework to transfer the knowledge from self-supervised pre-trained models to any other student network.
Inspired by the spirit of instance discrimination in self-supervised learning, we model the instance-instance relations by a graph formulation in the feature embedding space.
Our distillation scheme can be flexibly applied to transfer the self-supervised knowledge to enhance representation learning on various student networks.
arXiv Detail & Related papers (2022-11-23T19:27:48Z) - Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - Visualizing the embedding space to explain the effect of knowledge
distillation [5.678337324555035]
Recent research has found that knowledge distillation can be effective in reducing the size of a network.
Despite these advances, it still is relatively unclear emphwhy this method works, that is, what the resulting student model does 'better'
arXiv Detail & Related papers (2021-10-09T07:04:26Z) - Churn Reduction via Distillation [54.5952282395487]
We show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn.
We then show that distillation performs strongly for low churn training against a number of recent baselines.
arXiv Detail & Related papers (2021-06-04T18:03:31Z) - Distill on the Go: Online knowledge distillation in self-supervised
learning [1.1470070927586016]
Recent works have shown that wider and deeper models benefit more from self-supervised learning than smaller models.
We propose Distill-on-the-Go (DoGo), a self-supervised learning paradigm using single-stage online knowledge distillation.
Our results show significant performance gain in the presence of noisy and limited labels.
arXiv Detail & Related papers (2021-04-20T09:59:23Z) - Students are the Best Teacher: Exit-Ensemble Distillation with
Multi-Exits [25.140055086630838]
This paper proposes a novel knowledge distillation-based learning method to improve the classification performance of convolutional neural networks (CNNs)
Unlike the conventional notion of distillation where teachers only teach students, we show that students can also help other students and even the teacher to learn better.
arXiv Detail & Related papers (2021-04-01T07:10:36Z) - Beyond Self-Supervision: A Simple Yet Effective Network Distillation
Alternative to Improve Backbones [40.33419553042038]
We propose to improve existing baseline networks via knowledge distillation from off-the-shelf pre-trained big powerful models.
Our solution performs distillation by only driving prediction of the student model consistent with that of the teacher model.
We empirically find that such simple distillation settings perform extremely effective, for example, the top-1 accuracy on ImageNet-1k validation set of MobileNetV3-large and ResNet50-D can be significantly improved.
arXiv Detail & Related papers (2021-03-10T09:32:44Z) - Generative Adversarial Simulator [2.3986080077861787]
We introduce a simulator-free approach to knowledge distillation in the context of reinforcement learning.
A key challenge is having the student learn the multiplicity of cases that correspond to a given action.
This is the first demonstration of simulator-free knowledge distillation between a teacher and a student policy.
arXiv Detail & Related papers (2020-11-23T15:31:12Z) - RIFLE: Backpropagation in Depth for Deep Transfer Learning through
Re-Initializing the Fully-connected LayEr [60.07531696857743]
Fine-tuning the deep convolution neural network(CNN) using a pre-trained model helps transfer knowledge learned from larger datasets to the target task.
We propose RIFLE - a strategy that deepens backpropagation in transfer learning settings.
RIFLE brings meaningful updates to the weights of deep CNN layers and improves low-level feature learning.
arXiv Detail & Related papers (2020-07-07T11:27:43Z) - Neural Networks Are More Productive Teachers Than Human Raters: Active
Mixup for Data-Efficient Knowledge Distillation from a Blackbox Model [57.41841346459995]
We study how to train a student deep neural network for visual recognition by distilling knowledge from a blackbox teacher model in a data-efficient manner.
We propose an approach that blends mixup and active learning.
arXiv Detail & Related papers (2020-03-31T05:44:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.