Emerging Properties in Self-Supervised Vision Transformers
- URL: http://arxiv.org/abs/2104.14294v1
- Date: Thu, 29 Apr 2021 12:28:51 GMT
- Title: Emerging Properties in Self-Supervised Vision Transformers
- Authors: Mathilde Caron, Hugo Touvron, Ishan Misra, Herv\'e J\'egou, Julien
Mairal, Piotr Bojanowski, Armand Joulin
- Abstract summary: We show that self-supervised ViTs provide new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets)
We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels.
We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.
- Score: 57.36837447500544
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we question if self-supervised learning provides new
properties to Vision Transformer (ViT) that stand out compared to convolutional
networks (convnets). Beyond the fact that adapting self-supervised methods to
this architecture works particularly well, we make the following observations:
first, self-supervised ViT features contain explicit information about the
semantic segmentation of an image, which does not emerge as clearly with
supervised ViTs, nor with convnets. Second, these features are also excellent
k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. Our study
also underlines the importance of momentum encoder, multi-crop training, and
the use of small patches with ViTs. We implement our findings into a simple
self-supervised method, called DINO, which we interpret as a form of
self-distillation with no labels. We show the synergy between DINO and ViTs by
achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.
Related papers
- Patch-level Representation Learning for Self-supervised Vision
Transformers [68.8862419248863]
Vision Transformers (ViTs) have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks.
Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations.
We demonstrate that SelfPatch can significantly improve the performance of existing SSL methods for various visual tasks.
arXiv Detail & Related papers (2022-06-16T08:01:19Z) - SERE: Exploring Feature Self-relation for Self-supervised Transformer [79.5769147071757]
Vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks.
Recent works reveal that self-supervised learning helps unleash the great potential of ViT.
We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks.
arXiv Detail & Related papers (2022-06-10T15:25:00Z) - Position Labels for Self-Supervised Vision Transformer [1.3406858660972554]
Position encoding is important for vision transformer (ViT) to capture the spatial structure of the input image.
We propose two position labels dedicated to 2D images including absolute position and relative position.
Our position labels can be easily plugged into transformer, combined with the various current ViT variants.
arXiv Detail & Related papers (2022-06-10T10:29:20Z) - DeiT III: Revenge of the ViT [56.46810490275699]
A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks.
Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT.
arXiv Detail & Related papers (2022-04-14T17:13:44Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.