Vision Transformer for Small-Size Datasets
- URL: http://arxiv.org/abs/2112.13492v1
- Date: Mon, 27 Dec 2021 03:24:03 GMT
- Title: Vision Transformer for Small-Size Datasets
- Authors: Seung Hoon Lee, Seunghyun Lee, Byung Cheol Song
- Abstract summary: This paper proposes Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA)
SPT and LSA effectively solve the lack of locality inductive bias and enable it to learn from scratch even on small-size datasets.
Experimental results show that when both SPT and LSA were applied to the ViTs, the performance improved by an average of 2.96% in Tiny-ImageNet.
- Score: 23.855575212090365
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, the Vision Transformer (ViT), which applied the transformer
structure to the image classification task, has outperformed convolutional
neural networks. However, the high performance of the ViT results from
pre-training using a large-size dataset such as JFT-300M, and its dependence on
a large dataset is interpreted as due to low locality inductive bias. This
paper proposes Shifted Patch Tokenization (SPT) and Locality Self-Attention
(LSA), which effectively solve the lack of locality inductive bias and enable
it to learn from scratch even on small-size datasets. Moreover, SPT and LSA are
generic and effective add-on modules that are easily applicable to various
ViTs. Experimental results show that when both SPT and LSA were applied to the
ViTs, the performance improved by an average of 2.96% in Tiny-ImageNet, which
is a representative small-size dataset. Especially, Swin Transformer achieved
an overwhelming performance improvement of 4.08% thanks to the proposed SPT and
LSA.
Related papers
- UL-VIO: Ultra-lightweight Visual-Inertial Odometry with Noise Robust Test-time Adaptation [12.511829774226113]
We propose an ultra-lightweight (1M) visual-inertial odometry (VIO) network capable of test-time adaptation (TTA) based on visual-inertial consistency.
It achieves 36X smaller network size than state-of-the-art with a minute increase in error -- 1% on the KITTI dataset.
arXiv Detail & Related papers (2024-09-19T22:24:14Z) - GenFormer -- Generated Images are All You Need to Improve Robustness of Transformers on Small Datasets [11.343905946690352]
We propose GenFormer, a data augmentation strategy utilizing generated images to improve transformer accuracy and robustness on small-scale image classification tasks.
In our comprehensive evaluation we propose Tiny ImageNetV2, -R, and -A as new test set variants of Tiny ImageNet.
We prove the effectiveness of our approach under challenging conditions with limited training data, demonstrating significant improvements in both accuracy and robustness.
arXiv Detail & Related papers (2024-08-26T09:26:08Z) - Explicitly Increasing Input Information Density for Vision Transformers
on Small Datasets [26.257612622358614]
Vision Transformers have attracted a lot of attention recently since the successful implementation of Vision Transformer (ViT) on vision tasks.
This paper proposes to explicitly increase the input information density in the frequency domain.
Experiments demonstrate the effectiveness of the proposed approach on five small-scale datasets.
arXiv Detail & Related papers (2022-10-25T20:24:53Z) - How to Train Vision Transformer on Small-scale Datasets? [4.56717163175988]
In contrast to convolutional neural networks, Vision Transformer lacks inherent inductive biases.
We show that self-supervised inductive biases can be learned directly from small-scale datasets.
This allows to train these models without large-scale pre-training, changes to model architecture or loss functions.
arXiv Detail & Related papers (2022-10-13T17:59:19Z) - Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - Where are my Neighbors? Exploiting Patches Relations in Self-Supervised
Vision Transformer [3.158346511479111]
We propose a simple but still effective self-supervised learning (SSL) strategy to train Vision Transformers (ViTs)
We define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly during the downstream training.
Our RelViT model optimize all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signal at each training step.
arXiv Detail & Related papers (2022-06-01T13:25:32Z) - Towards Data-Efficient Detection Transformers [77.43470797296906]
We show most detection transformers suffer from significant performance drops on small-size datasets.
We empirically analyze the factors that affect data efficiency, through a step-by-step transition from a data-efficient RCNN variant to the representative DETR.
We introduce a simple yet effective label augmentation method to provide richer supervision and improve data efficiency.
arXiv Detail & Related papers (2022-03-17T17:56:34Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z) - How to train your ViT? Data, Augmentation, and Regularization in Vision
Transformers [74.06040005144382]
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications.
We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget.
We train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
arXiv Detail & Related papers (2021-06-18T17:58:20Z) - When Vision Transformers Outperform ResNets without Pretraining or
Strong Data Augmentations [111.44860506703307]
Vision Transformers (ViTs) and existing VisionNets signal efforts on replacing hand-wired features or inductive throughputs with general-purpose neural architectures.
This paper investigates ViTs and Res-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and inference.
We show that the improved robustness attributes to sparser active neurons in the first few layers.
The resultant ViTs outperform Nets of similar size and smoothness when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations.
arXiv Detail & Related papers (2021-06-03T02:08:03Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.