An Empirical Study of Training Self-Supervised Visual Transformers
- URL: http://arxiv.org/abs/2104.02057v1
- Date: Mon, 5 Apr 2021 17:59:40 GMT
- Title: An Empirical Study of Training Self-Supervised Visual Transformers
- Authors: Xinlei Chen and Saining Xie and Kaiming He
- Abstract summary: We study the effects of several fundamental components for training self-supervised Visual Transformers.
We reveal that these results are indeed partial failure, and they can be improved when training is made more stable.
- Score: 70.27107708555185
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper does not describe a novel method. Instead, it studies a
straightforward, incremental, yet must-know baseline given the recent progress
in computer vision: self-supervised learning for Visual Transformers (ViT).
While the training recipes for standard convolutional networks have been highly
mature and robust, the recipes for ViT are yet to be built, especially in the
self-supervised scenarios where training becomes more challenging. In this
work, we go back to basics and investigate the effects of several fundamental
components for training self-supervised ViT. We observe that instability is a
major issue that degrades accuracy, and it can be hidden by apparently good
results. We reveal that these results are indeed partial failure, and they can
be improved when training is made more stable. We benchmark ViT results in MoCo
v3 and several other self-supervised frameworks, with ablations in various
aspects. We discuss the currently positive evidence as well as challenges and
open questions. We hope that this work will provide useful data points and
experience for future research.
Related papers
- Experts Weights Averaging: A New General Training Scheme for Vision
Transformers [57.62386892571636]
We propose a training scheme for Vision Transformers (ViTs) that achieves performance improvement without increasing inference cost.
During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs.
After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference.
arXiv Detail & Related papers (2023-08-11T12:05:12Z) - Learning from Visual Observation via Offline Pretrained State-to-Go
Transformer [29.548242447584194]
We propose a two-stage framework for learning from visual observation.
In the first stage, we pretrain State-to-Go Transformer offline to predict and differentiate latent transitions of demonstrations.
In the second stage, the STG Transformer provides intrinsic rewards for downstream reinforcement learning tasks.
arXiv Detail & Related papers (2023-06-22T13:14:59Z) - Strong Baselines for Parameter Efficient Few-Shot Fine-tuning [50.83426196335385]
Few-shot classification (FSC) entails learning novel classes given only a few examples per class after a pre-training (or meta-training) phase.
Recent works have shown that simply fine-tuning a pre-trained Vision Transformer (ViT) on new test classes is a strong approach for FSC.
Fine-tuning ViTs, however, is expensive in time, compute and storage.
This has motivated the design of parameter efficient fine-tuning (PEFT) methods which fine-tune only a fraction of the Transformer's parameters.
arXiv Detail & Related papers (2023-04-04T16:14:39Z) - What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision.
This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs.
We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z) - Teaching Matters: Investigating the Role of Supervision in Vision
Transformers [32.79398665600664]
We show that Vision Transformers (ViTs) learn a diverse range of behaviors in terms of their attention, representations, and downstream performance.
We also discover ViT behaviors that are consistent across supervision, including the emergence of Offset Local Attention Heads.
Our analysis shows that ViTs are highly flexible and learn to process local and global information in different orders depending on their training method.
arXiv Detail & Related papers (2022-12-07T18:59:45Z) - DeiT III: Revenge of the ViT [56.46810490275699]
A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks.
Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT.
arXiv Detail & Related papers (2022-04-14T17:13:44Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - Scaled ReLU Matters for Training Vision Transformers [45.41439457701873]
Vision transformers (ViTs) have been an alternative design paradigm to convolutional neural networks (CNNs)
However, the training of ViTs is much harder than CNNs, as it is sensitive to the training parameters, such as learning rate, warmup and warmup.
We verify, both theoretically and empirically, that scaled ReLU in textitconv-stem not only improves training stabilization, but also increases the diversity of patch tokens.
arXiv Detail & Related papers (2021-09-08T17:57:58Z) - SiT: Self-supervised vIsion Transformer [23.265568744478333]
In natural language processing (NLP) self-supervised learning and transformers are already the methods of choice.
We propose Self-supervised vIsion Transformers (SiT) and discuss several self-supervised training mechanisms to obtain a pretext model.
We show that a pretrained SiT can be finetuned for a downstream classification task on small scale datasets.
arXiv Detail & Related papers (2021-04-08T08:34:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.