DeiT III: Revenge of the ViT
- URL: http://arxiv.org/abs/2204.07118v1
- Date: Thu, 14 Apr 2022 17:13:44 GMT
- Title: DeiT III: Revenge of the ViT
- Authors: Hugo Touvron, Matthieu Cord, Herv\'e J\'egou
- Abstract summary: A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks.
Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT.
- Score: 56.46810490275699
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A Vision Transformer (ViT) is a simple neural architecture amenable to serve
several computer vision tasks. It has limited built-in architectural priors, in
contrast to more recent architectures that incorporate priors either about the
input data or of specific tasks. Recent works show that ViTs benefit from
self-supervised pre-training, in particular BerT-like pre-training like BeiT.
In this paper, we revisit the supervised training of ViTs. Our procedure builds
upon and simplifies a recipe introduced for training ResNet-50. It includes a
new simple data-augmentation procedure with only 3 augmentations, closer to the
practice in self-supervised learning. Our evaluations on Image classification
(ImageNet-1k with and without pre-training on ImageNet-21k), transfer learning
and semantic segmentation show that our procedure outperforms by a large margin
previous fully supervised training recipes for ViT. It also reveals that the
performance of our ViT trained with supervision is comparable to that of more
recent architectures. Our results could serve as better baselines for recent
self-supervised approaches demonstrated on ViT.
Related papers
- Experts Weights Averaging: A New General Training Scheme for Vision
Transformers [57.62386892571636]
We propose a training scheme for Vision Transformers (ViTs) that achieves performance improvement without increasing inference cost.
During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs.
After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference.
arXiv Detail & Related papers (2023-08-11T12:05:12Z) - Rethinking Hierarchicies in Pre-trained Plain Vision Transformer [76.35955924137986]
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective.
customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT.
This paper proposes a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training.
arXiv Detail & Related papers (2022-11-03T13:19:23Z) - When Adversarial Training Meets Vision Transformers: Recipes from
Training to Architecture [32.260596998171835]
Adrial training is still required for ViTs to defend against such adversarial attacks.
We find that pre-training and SGD are necessary for ViTs' adversarial training.
Our code is available at https://versa.com/mo666666/When-Adrial-Training-Meets-Vision-Transformers.
arXiv Detail & Related papers (2022-10-14T05:37:20Z) - A Closer Look at Self-Supervised Lightweight Vision Transformers [44.44888945683147]
Self-supervised learning on large-scale Vision Transformers (ViTs) as pre-training methods has achieved promising downstream performance.
We benchmark several self-supervised pre-training methods on image classification tasks and some downstream dense prediction tasks.
Even vanilla lightweight ViTs show comparable performance to previous SOTA networks with delicate architecture design.
arXiv Detail & Related papers (2022-05-28T14:14:57Z) - Evaluating Vision Transformer Methods for Deep Reinforcement Learning
from Pixels [7.426118390008397]
We evaluate Vision Transformers (ViT) training methods for image-based reinforcement learning control tasks.
We compare these results to a leading convolutional-network architecture method, RAD.
We find that the CNN architectures trained using RAD still generally provide superior performance.
arXiv Detail & Related papers (2022-04-11T07:10:58Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - Emerging Properties in Self-Supervised Vision Transformers [57.36837447500544]
We show that self-supervised ViTs provide new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets)
We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels.
We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.
arXiv Detail & Related papers (2021-04-29T12:28:51Z) - An Empirical Study of Training Self-Supervised Visual Transformers [70.27107708555185]
We study the effects of several fundamental components for training self-supervised Visual Transformers.
We reveal that these results are indeed partial failure, and they can be improved when training is made more stable.
arXiv Detail & Related papers (2021-04-05T17:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.