Beyond Self-Supervision: A Simple Yet Effective Network Distillation
Alternative to Improve Backbones
- URL: http://arxiv.org/abs/2103.05959v1
- Date: Wed, 10 Mar 2021 09:32:44 GMT
- Title: Beyond Self-Supervision: A Simple Yet Effective Network Distillation
Alternative to Improve Backbones
- Authors: Cheng Cui and Ruoyu Guo and Yuning Du and Dongliang He and Fu Li and
Zewu Wu and Qiwen Liu and Shilei Wen and Jizhou Huang and Xiaoguang Hu and
Dianhai Yu and Errui Ding and Yanjun Ma
- Abstract summary: We propose to improve existing baseline networks via knowledge distillation from off-the-shelf pre-trained big powerful models.
Our solution performs distillation by only driving prediction of the student model consistent with that of the teacher model.
We empirically find that such simple distillation settings perform extremely effective, for example, the top-1 accuracy on ImageNet-1k validation set of MobileNetV3-large and ResNet50-D can be significantly improved.
- Score: 40.33419553042038
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, research efforts have been concentrated on revealing how
pre-trained model makes a difference in neural network performance.
Self-supervision and semi-supervised learning technologies have been
extensively explored by the community and are proven to be of great potential
in obtaining a powerful pre-trained model. However, these models require huge
training costs (i.e., hundreds of millions of images or training iterations).
In this paper, we propose to improve existing baseline networks via knowledge
distillation from off-the-shelf pre-trained big powerful models. Different from
existing knowledge distillation frameworks which require student model to be
consistent with both soft-label generated by teacher model and hard-label
annotated by humans, our solution performs distillation by only driving
prediction of the student model consistent with that of the teacher model.
Therefore, our distillation setting can get rid of manually labeled data and
can be trained with extra unlabeled data to fully exploit capability of teacher
model for better learning. We empirically find that such simple distillation
settings perform extremely effective, for example, the top-1 accuracy on
ImageNet-1k validation set of MobileNetV3-large and ResNet50-D can be
significantly improved from 75.2% to 79% and 79.1% to 83%, respectively. We
have also thoroughly analyzed what are dominant factors that affect the
distillation performance and how they make a difference. Extensive downstream
computer vision tasks, including transfer learning, object detection and
semantic segmentation, can significantly benefit from the distilled pretrained
models. All our experiments are implemented based on PaddlePaddle, codes and a
series of improved pretrained models with ssld suffix are available in
PaddleClas.
Related papers
- BOOT: Data-free Distillation of Denoising Diffusion Models with
Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images.
Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few.
We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z) - Establishing a stronger baseline for lightweight contrastive models [10.63129923292905]
Recent research has reported a performance degradation in self-supervised contrastive learning for specially designed efficient networks.
A common practice is to introduce a pretrained contrastive teacher model and train the lightweight networks with distillation signals generated by the teacher.
In this work, we aim to establish a stronger baseline for lightweight contrastive models without using a pretrained teacher model.
arXiv Detail & Related papers (2022-12-14T11:20:24Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - No One Representation to Rule Them All: Overlapping Features of Training
Methods [12.58238785151714]
High-performing models tend to make similar predictions regardless of training methodology.
Recent work has made very different training techniques, such as large-scale contrastive learning, yield competitively-high accuracy.
We show these models specialize in generalization of the data, leading to higher ensemble performance.
arXiv Detail & Related papers (2021-10-20T21:29:49Z) - Churn Reduction via Distillation [54.5952282395487]
We show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn.
We then show that distillation performs strongly for low churn training against a number of recent baselines.
arXiv Detail & Related papers (2021-06-04T18:03:31Z) - Distill on the Go: Online knowledge distillation in self-supervised
learning [1.1470070927586016]
Recent works have shown that wider and deeper models benefit more from self-supervised learning than smaller models.
We propose Distill-on-the-Go (DoGo), a self-supervised learning paradigm using single-stage online knowledge distillation.
Our results show significant performance gain in the presence of noisy and limited labels.
arXiv Detail & Related papers (2021-04-20T09:59:23Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z) - Large-Scale Generative Data-Free Distillation [17.510996270055184]
We propose a new method to train a generative image model by leveraging the intrinsic normalization layers' statistics.
The proposed method pushes forward the data-free distillation performance on CIFAR-10 and CIFAR-100 to 95.02% and 77.02% respectively.
We are able to scale it to ImageNet dataset, which to the best of our knowledge, has never been done using generative models in a data-free setting.
arXiv Detail & Related papers (2020-12-10T10:54:38Z) - Learning to Reweight with Deep Interactions [104.68509759134878]
We propose an improved data reweighting algorithm, in which the student model provides its internal states to the teacher model.
Experiments on image classification with clean/noisy labels and neural machine translation empirically demonstrate that our algorithm makes significant improvement over previous methods.
arXiv Detail & Related papers (2020-07-09T09:06:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.