ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond
- URL: http://arxiv.org/abs/2202.10108v1
- Date: Mon, 21 Feb 2022 10:40:05 GMT
- Title: ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond
- Authors: Qiming Zhang, Yufei Xu, Jing Zhang, Dacheng Tao
- Abstract summary: We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
- Score: 76.35955924137986
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers have shown great potential in various computer vision
tasks owing to their strong capability to model long-range dependency using the
self-attention mechanism. Nevertheless, they treat an image as a 1D sequence of
visual tokens, lacking an intrinsic inductive bias (IB) in modeling local
visual structures and dealing with scale variance, which is instead learned
implicitly from large-scale training data with longer training schedules. In
this paper, we propose a Vision Transformer Advanced by Exploring intrinsic IB
from convolutions, i.e., ViTAE. Technically, ViTAE has several spatial pyramid
reduction modules to downsample and embed the input image into tokens with rich
multi-scale context using multiple convolutions with different dilation rates.
In this way, it acquires an intrinsic scale invariance IB and can learn robust
feature representation for objects at various scales. Moreover, in each
transformer layer, ViTAE has a convolution block parallel to the multi-head
self-attention module, whose features are fused and fed into the feed-forward
network. Consequently, it has the intrinsic locality IB and is able to learn
local features and global dependencies collaboratively. The proposed two kinds
of cells are stacked in both isotropic and multi-stage manners to formulate two
families of ViTAE models, i.e., the vanilla ViTAE and ViTAEv2. Experiments on
the ImageNet dataset as well as downstream tasks on the MS COCO, ADE20K, and
AP10K datasets validate the superiority of our models over the baseline
transformer models and concurrent works. Besides, we scale up our ViTAE model
to 644M parameters and obtain the state-of-the-art classification performance,
i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the
best 91.2% Top-1 accuracy on ImageNet real validation set, without using extra
private data.
Related papers
- TiC: Exploring Vision Transformer in Convolution [37.50285921899263]
We propose the Multi-Head Self-Attention Convolution (MSA-Conv)
MSA-Conv incorporates Self-Attention within generalized convolutions, including standard, dilated, and depthwise ones.
We present the Vision Transformer in Convolution (TiC) as a proof of concept for image classification with MSA-Conv.
arXiv Detail & Related papers (2023-10-06T10:16:26Z) - How to Train Vision Transformer on Small-scale Datasets? [4.56717163175988]
In contrast to convolutional neural networks, Vision Transformer lacks inherent inductive biases.
We show that self-supervised inductive biases can be learned directly from small-scale datasets.
This allows to train these models without large-scale pre-training, changes to model architecture or loss functions.
arXiv Detail & Related papers (2022-10-13T17:59:19Z) - Advancing Plain Vision Transformer Towards Remote Sensing Foundation
Model [97.9548609175831]
We resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models customized for remote sensing tasks.
Specifically, to handle the large image size and objects of various orientations in RS images, we propose a new rotated varied-size window attention.
Experiments on detection tasks demonstrate the superiority of our model over all state-of-the-art models, achieving 81.16% mAP on the DOTA-V1.0 dataset.
arXiv Detail & Related papers (2022-08-08T09:08:40Z) - Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks.
We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer.
The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z) - ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [76.16156833138038]
We propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
In each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network.
arXiv Detail & Related papers (2021-06-07T05:31:06Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.