SEED: Self-supervised Distillation For Visual Representation
- URL: http://arxiv.org/abs/2101.04731v1
- Date: Tue, 12 Jan 2021 20:04:50 GMT
- Title: SEED: Self-supervised Distillation For Visual Representation
- Authors: Zhiyuan Fang, Jianfeng Wang, Lijuan Wang, Lei Zhang, Yezhou Yang,
Zicheng Liu
- Abstract summary: We propose a new learning paradigm, named SElf-SupErvised Distillation (SEED), to transfer its representational knowledge into a smaller architecture (as Student) in a self-supervised fashion.
We show that SEED dramatically boosts the performance of small networks on downstream tasks.
- Score: 34.63488756535054
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper is concerned with self-supervised learning for small models. The
problem is motivated by our empirical studies that while the widely used
contrastive self-supervised learning method has shown great progress on large
model training, it does not work well for small models. To address this
problem, we propose a new learning paradigm, named SElf-SupErvised Distillation
(SEED), where we leverage a larger network (as Teacher) to transfer its
representational knowledge into a smaller architecture (as Student) in a
self-supervised fashion. Instead of directly learning from unlabeled data, we
train a student encoder to mimic the similarity score distribution inferred by
a teacher over a set of instances. We show that SEED dramatically boosts the
performance of small networks on downstream tasks. Compared with
self-supervised baselines, SEED improves the top-1 accuracy from 42.2% to 67.6%
on EfficientNet-B0 and from 36.3% to 68.2% on MobileNet-v3-Large on the
ImageNet-1k dataset.
Related papers
- Retro: Reusing teacher projection head for efficient embedding distillation on Lightweight Models via Self-supervised Learning [0.0]
We propose textscRetro, which reuses the teacher's projection head for students.
Our experimental results demonstrate significant improvements over the state-of-the-art on all lightweight models.
arXiv Detail & Related papers (2024-05-24T07:53:09Z) - Teaching Language Models to Self-Improve through Interactive Demonstrations [83.9421355808174]
Self-improving ability of large language models has been shown to be absent and difficult to learn for smaller models.
We introduce TriPosT, a training algorithm that endows smaller models with such self-improvement ability.
We show that our approach can improve a LLaMA-7b's performance on math and reasoning tasks by up to 7.13%.
arXiv Detail & Related papers (2023-10-20T14:11:04Z) - Dual Learning for Large Vocabulary On-Device ASR [64.10124092250128]
Dual learning is a paradigm for semi-supervised machine learning that seeks to leverage unsupervised data by solving two opposite tasks at once.
We provide an analysis of an on-device-sized streaming conformer trained on the entirety of Librispeech, showing relative WER improvements of 10.7%/5.2% without an LM and 11.7%/16.4% with an LM.
arXiv Detail & Related papers (2023-01-11T06:32:28Z) - Establishing a stronger baseline for lightweight contrastive models [10.63129923292905]
Recent research has reported a performance degradation in self-supervised contrastive learning for specially designed efficient networks.
A common practice is to introduce a pretrained contrastive teacher model and train the lightweight networks with distillation signals generated by the teacher.
In this work, we aim to establish a stronger baseline for lightweight contrastive models without using a pretrained teacher model.
arXiv Detail & Related papers (2022-12-14T11:20:24Z) - Distilling Knowledge from Self-Supervised Teacher by Embedding Graph
Alignment [52.704331909850026]
We formulate a new knowledge distillation framework to transfer the knowledge from self-supervised pre-trained models to any other student network.
Inspired by the spirit of instance discrimination in self-supervised learning, we model the instance-instance relations by a graph formulation in the feature embedding space.
Our distillation scheme can be flexibly applied to transfer the self-supervised knowledge to enhance representation learning on various student networks.
arXiv Detail & Related papers (2022-11-23T19:27:48Z) - Distill on the Go: Online knowledge distillation in self-supervised
learning [1.1470070927586016]
Recent works have shown that wider and deeper models benefit more from self-supervised learning than smaller models.
We propose Distill-on-the-Go (DoGo), a self-supervised learning paradigm using single-stage online knowledge distillation.
Our results show significant performance gain in the presence of noisy and limited labels.
arXiv Detail & Related papers (2021-04-20T09:59:23Z) - DisCo: Remedy Self-supervised Learning on Lightweight Models with
Distilled Contrastive Learning [94.89221799550593]
Self-supervised representation learning (SSL) has received widespread attention from the community.
Recent research argue that its performance will suffer a cliff fall when the model size decreases.
We propose a simple yet effective Distilled Contrastive Learning (DisCo) to ease the issue by a large margin.
arXiv Detail & Related papers (2021-04-19T08:22:52Z) - Beyond Self-Supervision: A Simple Yet Effective Network Distillation
Alternative to Improve Backbones [40.33419553042038]
We propose to improve existing baseline networks via knowledge distillation from off-the-shelf pre-trained big powerful models.
Our solution performs distillation by only driving prediction of the student model consistent with that of the teacher model.
We empirically find that such simple distillation settings perform extremely effective, for example, the top-1 accuracy on ImageNet-1k validation set of MobileNetV3-large and ResNet50-D can be significantly improved.
arXiv Detail & Related papers (2021-03-10T09:32:44Z) - CompRess: Self-Supervised Learning by Compressing Representations [14.739041141948032]
We develop a model compression method to compress an already learned, deep self-supervised model (teacher) to a smaller one (student)
We train the student model so that it mimics the relative similarity between the data points in the teacher's embedding space.
This is the first time a self-supervised AlexNet has outperformed supervised one on ImageNet classification.
arXiv Detail & Related papers (2020-10-28T02:49:18Z) - Relation-Guided Representation Learning [53.60351496449232]
We propose a new representation learning method that explicitly models and leverages sample relations.
Our framework well preserves the relations between samples.
By seeking to embed samples into subspace, we show that our method can address the large-scale and out-of-sample problem.
arXiv Detail & Related papers (2020-07-11T10:57:45Z) - What Makes for Good Views for Contrastive Learning? [90.49736973404046]
We argue that we should reduce the mutual information (MI) between views while keeping task-relevant information intact.
We devise unsupervised and semi-supervised frameworks that learn effective views by aiming to reduce their MI.
As a by-product, we achieve a new state-of-the-art accuracy on unsupervised pre-training for ImageNet classification.
arXiv Detail & Related papers (2020-05-20T17:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.