Related papers: SimReg: Regression as a Simple Yet Effective Tool for Self-supervised Knowledge Distillation

SimReg: Regression as a Simple Yet Effective Tool for Self-supervised Knowledge Distillation

URL: http://arxiv.org/abs/2201.05131v1
Date: Thu, 13 Jan 2022 18:41:46 GMT
Title: SimReg: Regression as a Simple Yet Effective Tool for Self-supervised Knowledge Distillation
Authors: K L Navaneet, Soroush Abbasi Koohpayegani, Ajinkya Tejankar, Hamed Pirsiavash
Abstract summary: Feature regression is a simple way to distill large neural network models to smaller ones. We show that with simple changes to the network architecture, regression can outperform more complex state-of-the-art approaches for knowledge distillation.
Score: 14.739041141948032
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Feature regression is a simple way to distill large neural network models to smaller ones. We show that with simple changes to the network architecture, regression can outperform more complex state-of-the-art approaches for knowledge distillation from self-supervised models. Surprisingly, the addition of a multi-layer perceptron head to the CNN backbone is beneficial even if used only during distillation and discarded in the downstream task. Deeper non-linear projections can thus be used to accurately mimic the teacher without changing inference architecture and time. Moreover, we utilize independent projection heads to simultaneously distill multiple teacher networks. We also find that using the same weakly augmented image as input for both teacher and student networks aids distillation. Experiments on ImageNet dataset demonstrate the efficacy of the proposed changes in various self-supervised distillation settings.

Related papers

Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only. Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
Understanding the Gains from Repeated Self-Distillation [65.53673000292079]
Self-Distillation is a type of knowledge distillation where the student model has the same architecture as the teacher model. We show that the excess risk achieved by multi-step self-distillation can significantly improve upon a single step of self-distillation. Empirical results on regression tasks from the UCI repository show a reduction in the learnt model's risk (MSE) by up to 47%.
arXiv Detail & Related papers (2024-07-05T15:48:34Z)
Distilling Knowledge from Self-Supervised Teacher by Embedding Graph Alignment [52.704331909850026]
We formulate a new knowledge distillation framework to transfer the knowledge from self-supervised pre-trained models to any other student network. Inspired by the spirit of instance discrimination in self-supervised learning, we model the instance-instance relations by a graph formulation in the feature embedding space. Our distillation scheme can be flexibly applied to transfer the self-supervised knowledge to enhance representation learning on various student networks.
arXiv Detail & Related papers (2022-11-23T19:27:48Z)
Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models. We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers. A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z)
Visualizing the embedding space to explain the effect of knowledge distillation [5.678337324555035]
Recent research has found that knowledge distillation can be effective in reducing the size of a network. Despite these advances, it still is relatively unclear emphwhy this method works, that is, what the resulting student model does 'better'
arXiv Detail & Related papers (2021-10-09T07:04:26Z)
Churn Reduction via Distillation [54.5952282395487]
We show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn. We then show that distillation performs strongly for low churn training against a number of recent baselines.
arXiv Detail & Related papers (2021-06-04T18:03:31Z)
Distill on the Go: Online knowledge distillation in self-supervised learning [1.1470070927586016]
Recent works have shown that wider and deeper models benefit more from self-supervised learning than smaller models. We propose Distill-on-the-Go (DoGo), a self-supervised learning paradigm using single-stage online knowledge distillation. Our results show significant performance gain in the presence of noisy and limited labels.
arXiv Detail & Related papers (2021-04-20T09:59:23Z)
Students are the Best Teacher: Exit-Ensemble Distillation with Multi-Exits [25.140055086630838]
This paper proposes a novel knowledge distillation-based learning method to improve the classification performance of convolutional neural networks (CNNs) Unlike the conventional notion of distillation where teachers only teach students, we show that students can also help other students and even the teacher to learn better.
arXiv Detail & Related papers (2021-04-01T07:10:36Z)
Beyond Self-Supervision: A Simple Yet Effective Network Distillation Alternative to Improve Backbones [40.33419553042038]
We propose to improve existing baseline networks via knowledge distillation from off-the-shelf pre-trained big powerful models. Our solution performs distillation by only driving prediction of the student model consistent with that of the teacher model. We empirically find that such simple distillation settings perform extremely effective, for example, the top-1 accuracy on ImageNet-1k validation set of MobileNetV3-large and ResNet50-D can be significantly improved.
arXiv Detail & Related papers (2021-03-10T09:32:44Z)
Generative Adversarial Simulator [2.3986080077861787]
We introduce a simulator-free approach to knowledge distillation in the context of reinforcement learning. A key challenge is having the student learn the multiplicity of cases that correspond to a given action. This is the first demonstration of simulator-free knowledge distillation between a teacher and a student policy.
arXiv Detail & Related papers (2020-11-23T15:31:12Z)
RIFLE: Backpropagation in Depth for Deep Transfer Learning through Re-Initializing the Fully-connected LayEr [60.07531696857743]
Fine-tuning the deep convolution neural network(CNN) using a pre-trained model helps transfer knowledge learned from larger datasets to the target task. We propose RIFLE - a strategy that deepens backpropagation in transfer learning settings. RIFLE brings meaningful updates to the weights of deep CNN layers and improves low-level feature learning.
arXiv Detail & Related papers (2020-07-07T11:27:43Z)
Neural Networks Are More Productive Teachers Than Human Raters: Active Mixup for Data-Efficient Knowledge Distillation from a Blackbox Model [57.41841346459995]
We study how to train a student deep neural network for visual recognition by distilling knowledge from a blackbox teacher model in a data-efficient manner. We propose an approach that blends mixup and active learning.
arXiv Detail & Related papers (2020-03-31T05:44:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.