Attention Distillation: self-supervised vision transformer students need
more guidance
- URL: http://arxiv.org/abs/2210.00944v1
- Date: Mon, 3 Oct 2022 14:01:46 GMT
- Title: Attention Distillation: self-supervised vision transformer students need
more guidance
- Authors: Kai Wang, Fei Yang, Joost van de Weijer
- Abstract summary: We study knowledge distillation of self-supervised vision transformers (ViT-SSKD)
In experiments on ImageNet-Subset and ImageNet-1K, we show that our method AttnDistill outperforms existing self-supervised knowledge distillation (SSKD) methods.
We are also the first to apply the tiny ViT-T model on self-supervised learning.
- Score: 37.37272042838411
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised learning has been widely applied to train high-quality vision
transformers. Unleashing their excellent performance on memory and compute
constraint devices is therefore an important research topic. However, how to
distill knowledge from one self-supervised ViT to another has not yet been
explored. Moreover, the existing self-supervised knowledge distillation (SSKD)
methods focus on ConvNet based architectures are suboptimal for ViT knowledge
distillation. In this paper, we study knowledge distillation of self-supervised
vision transformers (ViT-SSKD). We show that directly distilling information
from the crucial attention mechanism from teacher to student can significantly
narrow the performance gap between both. In experiments on ImageNet-Subset and
ImageNet-1K, we show that our method AttnDistill outperforms existing
self-supervised knowledge distillation (SSKD) methods and achieves
state-of-the-art k-NN accuracy compared with self-supervised learning (SSL)
methods learning from scratch (with the ViT-S model). We are also the first to
apply the tiny ViT-T model on self-supervised learning. Moreover, AttnDistill
is independent of self-supervised learning algorithms, it can be adapted to ViT
based SSL methods to improve the performance in future research. The code is
here: https://github.com/wangkai930418/attndistill
Related papers
- Supervised Masked Knowledge Distillation for Few-Shot Transformers [36.46755346410219]
We propose a novel Supervised Masked Knowledge Distillation model (SMKD) for few-shot Transformers.
Compared with previous self-supervised methods, we allow intra-class knowledge distillation on both class and patch tokens.
Our method with simple design outperforms previous methods by a large margin and achieves a new start-of-the-art.
arXiv Detail & Related papers (2023-03-25T03:31:46Z) - Teaching Matters: Investigating the Role of Supervision in Vision
Transformers [32.79398665600664]
We show that Vision Transformers (ViTs) learn a diverse range of behaviors in terms of their attention, representations, and downstream performance.
We also discover ViT behaviors that are consistent across supervision, including the emergence of Offset Local Attention Heads.
Our analysis shows that ViTs are highly flexible and learn to process local and global information in different orders depending on their training method.
arXiv Detail & Related papers (2022-12-07T18:59:45Z) - SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z) - SERE: Exploring Feature Self-relation for Self-supervised Transformer [79.5769147071757]
Vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks.
Recent works reveal that self-supervised learning helps unleash the great potential of ViT.
We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks.
arXiv Detail & Related papers (2022-06-10T15:25:00Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - Emerging Properties in Self-Supervised Vision Transformers [57.36837447500544]
We show that self-supervised ViTs provide new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets)
We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels.
We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.
arXiv Detail & Related papers (2021-04-29T12:28:51Z) - Distill on the Go: Online knowledge distillation in self-supervised
learning [1.1470070927586016]
Recent works have shown that wider and deeper models benefit more from self-supervised learning than smaller models.
We propose Distill-on-the-Go (DoGo), a self-supervised learning paradigm using single-stage online knowledge distillation.
Our results show significant performance gain in the presence of noisy and limited labels.
arXiv Detail & Related papers (2021-04-20T09:59:23Z) - SiT: Self-supervised vIsion Transformer [23.265568744478333]
In natural language processing (NLP) self-supervised learning and transformers are already the methods of choice.
We propose Self-supervised vIsion Transformers (SiT) and discuss several self-supervised training mechanisms to obtain a pretext model.
We show that a pretrained SiT can be finetuned for a downstream classification task on small scale datasets.
arXiv Detail & Related papers (2021-04-08T08:34:04Z) - Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network.
We show that the seemingly different self-supervision task can serve as a simple yet powerful solution.
By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.