How to train your ViT? Data, Augmentation, and Regularization in Vision
Transformers
- URL: http://arxiv.org/abs/2106.10270v1
- Date: Fri, 18 Jun 2021 17:58:20 GMT
- Title: How to train your ViT? Data, Augmentation, and Regularization in Vision
Transformers
- Authors: Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman,
Jakob Uszkoreit, Lucas Beyer
- Abstract summary: Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications.
We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget.
We train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
- Score: 74.06040005144382
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViT) have been shown to attain highly competitive
performance for a wide range of vision applications, such as image
classification, object detection and semantic image segmentation. In comparison
to convolutional neural networks, the Vision Transformer's weaker inductive
bias is generally found to cause an increased reliance on model regularization
or data augmentation (``AugReg'' for short) when training on smaller training
datasets. We conduct a systematic empirical study in order to better understand
the interplay between the amount of training data, AugReg, model size and
compute budget. As one result of this study we find that the combination of
increased compute and AugReg can yield models with the same performance as
models trained on an order of magnitude more training data: we train ViT models
of various sizes on the public ImageNet-21k dataset which either match or
outperform their counterparts trained on the larger, but not publicly available
JFT-300M dataset.
Related papers
- How to Train Vision Transformer on Small-scale Datasets? [4.56717163175988]
In contrast to convolutional neural networks, Vision Transformer lacks inherent inductive biases.
We show that self-supervised inductive biases can be learned directly from small-scale datasets.
This allows to train these models without large-scale pre-training, changes to model architecture or loss functions.
arXiv Detail & Related papers (2022-10-13T17:59:19Z) - Where are my Neighbors? Exploiting Patches Relations in Self-Supervised
Vision Transformer [3.158346511479111]
We propose a simple but still effective self-supervised learning (SSL) strategy to train Vision Transformers (ViTs)
We define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly during the downstream training.
Our RelViT model optimize all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signal at each training step.
arXiv Detail & Related papers (2022-06-01T13:25:32Z) - ViT-P: Rethinking Data-efficient Vision Transformers from Locality [9.515925867530262]
We make vision transformers as data-efficient as convolutional neural networks by introducing multi-focal attention bias.
Inspired by the attention distance in a well-trained ViT, we constrain the self-attention of ViT to have multi-scale localized receptive field.
On Cifar100, our ViT-P Base model achieves the state-of-the-art accuracy (83.16%) trained from scratch.
arXiv Detail & Related papers (2022-03-04T14:49:48Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Training Vision Transformers with Only 2040 Images [35.86457465241119]
Vision Transformers (ViTs) is emerging as an alternative to convolutional neural networks (CNNs) for visual recognition.
We give theoretical analyses that our method is superior to other methods in that it can capture both feature alignment and instance similarities.
We achieve state-of-the-art results when training from scratch on 7 small datasets under various ViT backbones.
arXiv Detail & Related papers (2022-01-26T03:22:08Z) - Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute.
We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy.
The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z) - When Vision Transformers Outperform ResNets without Pretraining or
Strong Data Augmentations [111.44860506703307]
Vision Transformers (ViTs) and existing VisionNets signal efforts on replacing hand-wired features or inductive throughputs with general-purpose neural architectures.
This paper investigates ViTs and Res-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and inference.
We show that the improved robustness attributes to sparser active neurons in the first few layers.
The resultant ViTs outperform Nets of similar size and smoothness when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations.
arXiv Detail & Related papers (2021-06-03T02:08:03Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.