How to Train Vision Transformer on Small-scale Datasets?
- URL: http://arxiv.org/abs/2210.07240v1
- Date: Thu, 13 Oct 2022 17:59:19 GMT
- Title: How to Train Vision Transformer on Small-scale Datasets?
- Authors: Hanan Gani, Muzammal Naseer and Mohammad Yaqub
- Abstract summary: In contrast to convolutional neural networks, Vision Transformer lacks inherent inductive biases.
We show that self-supervised inductive biases can be learned directly from small-scale datasets.
This allows to train these models without large-scale pre-training, changes to model architecture or loss functions.
- Score: 4.56717163175988
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision Transformer (ViT), a radically different architecture than
convolutional neural networks offers multiple advantages including design
simplicity, robustness and state-of-the-art performance on many vision tasks.
However, in contrast to convolutional neural networks, Vision Transformer lacks
inherent inductive biases. Therefore, successful training of such models is
mainly attributed to pre-training on large-scale datasets such as ImageNet with
1.2M or JFT with 300M images. This hinders the direct adaption of Vision
Transformer for small-scale datasets. In this work, we show that
self-supervised inductive biases can be learned directly from small-scale
datasets and serve as an effective weight initialization scheme for
fine-tuning. This allows to train these models without large-scale
pre-training, changes to model architecture or loss functions. We present
thorough experiments to successfully train monolithic and non-monolithic Vision
Transformers on five small datasets including CIFAR10/100, CINIC10, SVHN,
Tiny-ImageNet and two fine-grained datasets: Aircraft and Cars. Our approach
consistently improves the performance of Vision Transformers while retaining
their properties such as attention to salient regions and higher robustness.
Our codes and pre-trained models are available at:
https://github.com/hananshafi/vits-for-small-scale-datasets.
Related papers
- Pre-training of Lightweight Vision Transformers on Small Datasets with
Minimally Scaled Images [0.0]
A pure Vision Transformer (ViT) can achieve superior performance through pre-training, using a masked auto-encoder technique with minimal image scaling.
Experiments on the CIFAR-10 and CIFAR-100 datasets involved ViT models with fewer than 3.65 million parameters and a multiply-accumulate (MAC) count below 0.27G.
arXiv Detail & Related papers (2024-02-06T06:41:24Z) - Advancing Plain Vision Transformer Towards Remote Sensing Foundation
Model [97.9548609175831]
We resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models customized for remote sensing tasks.
Specifically, to handle the large image size and objects of various orientations in RS images, we propose a new rotated varied-size window attention.
Experiments on detection tasks demonstrate the superiority of our model over all state-of-the-art models, achieving 81.16% mAP on the DOTA-V1.0 dataset.
arXiv Detail & Related papers (2022-08-08T09:08:40Z) - ViT-P: Rethinking Data-efficient Vision Transformers from Locality [9.515925867530262]
We make vision transformers as data-efficient as convolutional neural networks by introducing multi-focal attention bias.
Inspired by the attention distance in a well-trained ViT, we constrain the self-attention of ViT to have multi-scale localized receptive field.
On Cifar100, our ViT-P Base model achieves the state-of-the-art accuracy (83.16%) trained from scratch.
arXiv Detail & Related papers (2022-03-04T14:49:48Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Training Vision Transformers with Only 2040 Images [35.86457465241119]
Vision Transformers (ViTs) is emerging as an alternative to convolutional neural networks (CNNs) for visual recognition.
We give theoretical analyses that our method is superior to other methods in that it can capture both feature alignment and instance similarities.
We achieve state-of-the-art results when training from scratch on 7 small datasets under various ViT backbones.
arXiv Detail & Related papers (2022-01-26T03:22:08Z) - How to train your ViT? Data, Augmentation, and Regularization in Vision
Transformers [74.06040005144382]
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications.
We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget.
We train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
arXiv Detail & Related papers (2021-06-18T17:58:20Z) - Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute.
We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy.
The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z) - When Vision Transformers Outperform ResNets without Pretraining or
Strong Data Augmentations [111.44860506703307]
Vision Transformers (ViTs) and existing VisionNets signal efforts on replacing hand-wired features or inductive throughputs with general-purpose neural architectures.
This paper investigates ViTs and Res-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and inference.
We show that the improved robustness attributes to sparser active neurons in the first few layers.
The resultant ViTs outperform Nets of similar size and smoothness when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations.
arXiv Detail & Related papers (2021-06-03T02:08:03Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.