Better plain ViT baselines for ImageNet-1k
- URL: http://arxiv.org/abs/2205.01580v1
- Date: Tue, 3 May 2022 15:54:44 GMT
- Title: Better plain ViT baselines for ImageNet-1k
- Authors: Lucas Beyer, Xiaohua Zhai, Alexander Kolesnikov
- Abstract summary: It is commonly accepted that the Vision Transformer model requires sophisticated regularization techniques to excel at ImageNet-1k scale data.
This note presents a few minor modifications to the original Vision Transformer (ViT) vanilla training setting that dramatically improve the performance of plain ViT models.
- Score: 100.80574771242937
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It is commonly accepted that the Vision Transformer model requires
sophisticated regularization techniques to excel at ImageNet-1k scale data.
Surprisingly, we find this is not the case and standard data augmentation is
sufficient. This note presents a few minor modifications to the original Vision
Transformer (ViT) vanilla training setting that dramatically improve the
performance of plain ViT models. Notably, 90 epochs of training surpass 76%
top-1 accuracy in under seven hours on a TPUv3-8, similar to the classic
ResNet50 baseline, and 300 epochs of training reach 80% in less than one day.
Related papers
- DeiT III: Revenge of the ViT [56.46810490275699]
A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks.
Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT.
arXiv Detail & Related papers (2022-04-14T17:13:44Z) - Improving Vision Transformers by Revisiting High-frequency Components [106.7140968644414]
We show that Vision Transformer (ViT) models are less effective in capturing the high-frequency components of images than CNN models.
To compensate, we propose HAT, which directly augments high-frequency components of images via adversarial training.
We show that HAT can consistently boost the performance of various ViT models.
arXiv Detail & Related papers (2022-04-03T05:16:51Z) - MVP: Multimodality-guided Visual Pre-training [215.11351064601303]
Masked image modeling (MIM) has become a promising direction for visual pre-training.
In this paper, we introduce guidance from other modalities and validate that such additional knowledge leads to impressive gains for visual pre-training.
The proposed approach is named Multimodality-guided Visual Pre-training (MVP), in which we replace the tokenizer with the vision branch of CLIP, a vision-language model pre-trained on 400 million image-text pairs.
arXiv Detail & Related papers (2022-03-10T06:11:20Z) - Improving Vision Transformers for Incremental Learning [17.276384689286168]
This paper studies using Vision Transformers (ViT) in class incremental learning.
ViT has very slow convergence when class number is small.
More bias towards new classes is observed in ViT than CNN-based models.
arXiv Detail & Related papers (2021-12-12T00:12:33Z) - Pyramid Adversarial Training Improves ViT Performance [43.322865996422664]
Pyramid Adversarial Training is a simple and effective technique to improve ViT's overall performance.
It leads to $1.82%$ absolute improvement on ImageNet clean accuracy for the ViT-B model when trained only on ImageNet-1K data.
arXiv Detail & Related papers (2021-11-30T04:38:14Z) - How to train your ViT? Data, Augmentation, and Regularization in Vision
Transformers [74.06040005144382]
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications.
We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget.
We train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
arXiv Detail & Related papers (2021-06-18T17:58:20Z) - Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute.
We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy.
The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z) - Self-Supervised Learning with Swin Transformers [24.956637957269926]
We present a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture.
The approach basically has no new inventions, which is combined from MoCo v2 and BYOL.
The performance is slightly better than recent works of MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter tricks.
arXiv Detail & Related papers (2021-05-10T17:59:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.