DeepViT: Towards Deeper Vision Transformer
- URL: http://arxiv.org/abs/2103.11886v2
- Date: Tue, 23 Mar 2021 14:45:44 GMT
- Title: DeepViT: Towards Deeper Vision Transformer
- Authors: Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian,
Qibin Hou, Jiashi Feng
- Abstract summary: Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
- Score: 92.04063170357426
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers (ViTs) have been successfully applied in image
classification tasks recently. In this paper, we show that, unlike convolution
neural networks (CNNs)that can be improved by stacking more convolutional
layers, the performance of ViTs saturate fast when scaled to be deeper. More
specifically, we empirically observe that such scaling difficulty is caused by
the attention collapse issue: as the transformer goes deeper, the attention
maps gradually become similar and even much the same after certain layers. In
other words, the feature maps tend to be identical in the top layers of deep
ViT models. This fact demonstrates that in deeper layers of ViTs, the
self-attention mechanism fails to learn effective concepts for representation
learning and hinders the model from getting expected performance gain. Based on
above observation, we propose a simple yet effective method, named
Re-attention, to re-generate the attention maps to increase their diversity at
different layers with negligible computation and memory cost. The pro-posed
method makes it feasible to train deeper ViT models with consistent performance
improvements via minor modification to existing ViT models. Notably, when
training a deep ViT model with 32 transformer blocks, the Top-1 classification
accuracy can be improved by 1.6% on ImageNet. Code will be made publicly
available
Related papers
- Rethinking Hierarchicies in Pre-trained Plain Vision Transformer [76.35955924137986]
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective.
customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT.
This paper proposes a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training.
arXiv Detail & Related papers (2022-11-03T13:19:23Z) - Improving Vision Transformers by Revisiting High-frequency Components [106.7140968644414]
We show that Vision Transformer (ViT) models are less effective in capturing the high-frequency components of images than CNN models.
To compensate, we propose HAT, which directly augments high-frequency components of images via adversarial training.
We show that HAT can consistently boost the performance of various ViT models.
arXiv Detail & Related papers (2022-04-03T05:16:51Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks.
Are they acting like convolutional networks, or learning entirely different visual representations?
We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z) - Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute.
We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy.
The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z) - ConViT: Improving Vision Transformers with Soft Convolutional Inductive
Biases [16.308432111311195]
Vision Transformers (ViTs) rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification.
We introduce gated positional self-attention (GPSA), a form of positional self-attention which can be equipped with a "soft" convolutional inductive bias.
The resulting convolutional-like ViT architecture, ConViT, outperforms the DeiT on ImageNet, while offering a much improved sample efficiency.
arXiv Detail & Related papers (2021-03-19T09:11:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.