Scaling Vision Transformers to 22 Billion Parameters
- URL: http://arxiv.org/abs/2302.05442v1
- Date: Fri, 10 Feb 2023 18:58:21 GMT
- Title: Scaling Vision Transformers to 22 Billion Parameters
- Authors: Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski,
Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert
Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael
Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan
Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F.
Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn
Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodkar, Cristina
Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Paveti\'c,
Dustin Tran, Thomas Kipf, Mario Lu\v{c}i\'c, Xiaohua Zhai, Daniel Keysers,
Jeremiah Harmsen, Neil Houlsby
- Abstract summary: Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been scaled to nearly the same degree.
We present a recipe for highly efficient and stable training of a 22B- parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model.
ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.
- Score: 140.67853929168382
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The scaling of Transformers has driven breakthrough capabilities for language
models. At present, the largest large language models (LLMs) contain upwards of
100B parameters. Vision Transformers (ViT) have introduced the same
architecture to image and video modelling, but these have not yet been
successfully scaled to nearly the same degree; the largest dense ViT contains
4B parameters (Chen et al., 2022). We present a recipe for highly efficient and
stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of
experiments on the resulting model. When evaluated on downstream tasks (often
with a lightweight linear model on frozen features), ViT-22B demonstrates
increasing performance with scale. We further observe other interesting
benefits of scale, including an improved tradeoff between fairness and
performance, state-of-the-art alignment to human visual perception in terms of
shape/texture bias, and improved robustness. ViT-22B demonstrates the potential
for "LLM-like" scaling in vision, and provides key steps towards getting there.
Related papers
- DiffiT: Diffusion Vision Transformers for Image Generation [88.08529836125399]
Vision Transformer (ViT) has demonstrated strong modeling capabilities and scalability, especially for recognition tasks.
We study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT)
DiffiT is surprisingly effective in generating high-fidelity images with significantly better parameter efficiency.
arXiv Detail & Related papers (2023-12-04T18:57:01Z) - Improving Vision Transformers by Revisiting High-frequency Components [106.7140968644414]
We show that Vision Transformer (ViT) models are less effective in capturing the high-frequency components of images than CNN models.
To compensate, we propose HAT, which directly augments high-frequency components of images via adversarial training.
We show that HAT can consistently boost the performance of various ViT models.
arXiv Detail & Related papers (2022-04-03T05:16:51Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute.
We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy.
The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z) - ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [76.16156833138038]
We propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
In each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network.
arXiv Detail & Related papers (2021-06-07T05:31:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.