ViT-P: Rethinking Data-efficient Vision Transformers from Locality
- URL: http://arxiv.org/abs/2203.02358v1
- Date: Fri, 4 Mar 2022 14:49:48 GMT
- Title: ViT-P: Rethinking Data-efficient Vision Transformers from Locality
- Authors: Bin Chen, Ran Wang, Di Ming and Xin Feng
- Abstract summary: We make vision transformers as data-efficient as convolutional neural networks by introducing multi-focal attention bias.
Inspired by the attention distance in a well-trained ViT, we constrain the self-attention of ViT to have multi-scale localized receptive field.
On Cifar100, our ViT-P Base model achieves the state-of-the-art accuracy (83.16%) trained from scratch.
- Score: 9.515925867530262
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances of Transformers have brought new trust to computer vision
tasks. However, on small dataset, Transformers is hard to train and has lower
performance than convolutional neural networks. We make vision transformers as
data-efficient as convolutional neural networks by introducing multi-focal
attention bias. Inspired by the attention distance in a well-trained ViT, we
constrain the self-attention of ViT to have multi-scale localized receptive
field. The size of receptive field is adaptable during training so that optimal
configuration can be learned. We provide empirical evidence that proper
constrain of receptive field can reduce the amount of training data for vision
transformers. On Cifar100, our ViT-P Base model achieves the state-of-the-art
accuracy (83.16%) trained from scratch. We also perform analysis on ImageNet to
show our method does not lose accuracy on large data sets.
Related papers
- Explicitly Increasing Input Information Density for Vision Transformers
on Small Datasets [26.257612622358614]
Vision Transformers have attracted a lot of attention recently since the successful implementation of Vision Transformer (ViT) on vision tasks.
This paper proposes to explicitly increase the input information density in the frequency domain.
Experiments demonstrate the effectiveness of the proposed approach on five small-scale datasets.
arXiv Detail & Related papers (2022-10-25T20:24:53Z) - How to Train Vision Transformer on Small-scale Datasets? [4.56717163175988]
In contrast to convolutional neural networks, Vision Transformer lacks inherent inductive biases.
We show that self-supervised inductive biases can be learned directly from small-scale datasets.
This allows to train these models without large-scale pre-training, changes to model architecture or loss functions.
arXiv Detail & Related papers (2022-10-13T17:59:19Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - Where are my Neighbors? Exploiting Patches Relations in Self-Supervised
Vision Transformer [3.158346511479111]
We propose a simple but still effective self-supervised learning (SSL) strategy to train Vision Transformers (ViTs)
We define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly during the downstream training.
Our RelViT model optimize all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signal at each training step.
arXiv Detail & Related papers (2022-06-01T13:25:32Z) - Training Vision Transformers with Only 2040 Images [35.86457465241119]
Vision Transformers (ViTs) is emerging as an alternative to convolutional neural networks (CNNs) for visual recognition.
We give theoretical analyses that our method is superior to other methods in that it can capture both feature alignment and instance similarities.
We achieve state-of-the-art results when training from scratch on 7 small datasets under various ViT backbones.
arXiv Detail & Related papers (2022-01-26T03:22:08Z) - Semi-Supervised Vision Transformers [76.83020291497895]
We study the training of Vision Transformers for semi-supervised image classification.
We find Vision Transformers perform poorly on a semi-supervised ImageNet setting.
CNNs achieve superior results in small labeled data regime.
arXiv Detail & Related papers (2021-11-22T09:28:13Z) - Can Vision Transformers Perform Convolution? [78.42076260340869]
We prove that a single ViT layer with image patches as the input can perform any convolution operation constructively.
We provide a lower bound on the number of heads for Vision Transformers to express CNNs.
arXiv Detail & Related papers (2021-11-02T03:30:17Z) - ConvNets vs. Transformers: Whose Visual Representations are More
Transferable? [49.62201738334348]
We investigate the transfer learning ability of ConvNets and vision transformers in 15 single-task and multi-task performance evaluations.
We observe consistent advantages of Transformer-based backbones on 13 downstream tasks.
arXiv Detail & Related papers (2021-08-11T16:20:38Z) - How to train your ViT? Data, Augmentation, and Regularization in Vision
Transformers [74.06040005144382]
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications.
We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget.
We train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
arXiv Detail & Related papers (2021-06-18T17:58:20Z) - Self-Supervised Learning with Swin Transformers [24.956637957269926]
We present a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture.
The approach basically has no new inventions, which is combined from MoCo v2 and BYOL.
The performance is slightly better than recent works of MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter tricks.
arXiv Detail & Related papers (2021-05-10T17:59:45Z) - Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with
56M Parameters on ImageNet [86.95679590801494]
We explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques.
We show that by slightly tuning the structure of vision transformers and introducing token labeling, our models are able to achieve better results than the CNN counterparts.
arXiv Detail & Related papers (2021-04-22T04:43:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.