Efficient Training of Visual Transformers with Small-Size Datasets
- URL: http://arxiv.org/abs/2106.03746v1
- Date: Mon, 7 Jun 2021 16:14:06 GMT
- Title: Efficient Training of Visual Transformers with Small-Size Datasets
- Authors: Yahui Liu, Enver Sangineto, Wei Bi, Nicu Sebe, Bruno Lepri and Marco
De Nadai
- Abstract summary: Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs)
We show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different.
We propose a self-supervised task which can extract additional information from images with only a negligible computational overhead.
- Score: 64.60765211331697
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Transformers (VTs) are emerging as an architectural paradigm
alternative to Convolutional networks (CNNs). Differently from CNNs, VTs can
capture global relations between image elements and they potentially have a
larger representation capacity. However, the lack of the typical convolutional
inductive bias makes these models more data-hungry than common CNNs. In fact,
some local properties of the visual domain which are embedded in the CNN
architectural design, in VTs should be learned from samples. In this paper, we
empirically analyse different VTs, comparing their robustness in a small
training-set regime, and we show that, despite having a comparable accuracy
when trained on ImageNet, their performance on smaller datasets can be largely
different. Moreover, we propose a self-supervised task which can extract
additional information from images with only a negligible computational
overhead. This task encourages the VTs to learn spatial relations within an
image and makes the VT training much more robust when training data are scarce.
Our task is used jointly with the standard (supervised) training and it does
not depend on specific architectural choices, thus it can be easily plugged in
the existing VTs. Using an extensive evaluation with different VTs and
datasets, we show that our method can improve (sometimes dramatically) the
final accuracy of the VTs. The code will be available upon acceptance.
Related papers
- Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets.
DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z) - Masked autoencoders are effective solution to transformer data-hungry [0.0]
Vision Transformers (ViTs) outperforms convolutional neural networks (CNNs) in several vision tasks with its global modeling capabilities.
ViT lacks the inductive bias inherent to convolution making it require a large amount of data for training.
Masked autoencoders (MAE) can make the transformer focus more on the image itself.
arXiv Detail & Related papers (2022-12-12T03:15:19Z) - How Well Do Vision Transformers (VTs) Transfer To The Non-Natural Image
Domain? An Empirical Study Involving Art Classification [0.7614628596146599]
Vision Transformers (VTs) are becoming a valuable alternative to Convolutional Neural Networks (CNNs)
We study whether VTs that are pre-trained on the popular ImageNet dataset learn representations that are transferable to the non-natural image domain.
Our results show that VTs exhibit strong generalization properties and that these networks are more powerful feature extractors than CNNs.
arXiv Detail & Related papers (2022-08-09T12:05:18Z) - Locality Guidance for Improving Vision Transformers on Tiny Datasets [17.352384588114838]
Vision Transformer (VT) architecture is becoming trendy in computer vision, but pure VT models perform poorly on tiny datasets.
This paper proposes the locality guidance for improving the performance of VTs on tiny datasets.
arXiv Detail & Related papers (2022-07-20T16:41:41Z) - Towards Scale Consistent Monocular Visual Odometry by Learning from the
Virtual World [83.36195426897768]
We propose VRVO, a novel framework for retrieving the absolute scale from virtual data.
We first train a scale-aware disparity network using both monocular real images and stereo virtual data.
The resulting scale-consistent disparities are then integrated with a direct VO system.
arXiv Detail & Related papers (2022-03-11T01:51:54Z) - Training Vision Transformers with Only 2040 Images [35.86457465241119]
Vision Transformers (ViTs) is emerging as an alternative to convolutional neural networks (CNNs) for visual recognition.
We give theoretical analyses that our method is superior to other methods in that it can capture both feature alignment and instance similarities.
We achieve state-of-the-art results when training from scratch on 7 small datasets under various ViT backbones.
arXiv Detail & Related papers (2022-01-26T03:22:08Z) - A Comprehensive Study of Vision Transformers on Dense Prediction Tasks [10.013443811899466]
Convolutional Neural Networks (CNNs) have been the standard choice in vision tasks.
Recent studies have shown that Vision Transformers (VTs) achieve comparable performance in challenging tasks such as object detection and semantic segmentation.
This poses several questions about their generalizability, robustness, reliability, and texture bias when used to extract features for complex tasks.
arXiv Detail & Related papers (2022-01-21T13:18:16Z) - BEVT: BERT Pretraining of Video Transformers [89.08460834954161]
We introduce BEVT which decouples video representation learning into spatial representation learning and temporal dynamics learning.
We conduct extensive experiments on three challenging video benchmarks where BEVT achieves very promising results.
arXiv Detail & Related papers (2021-12-02T18:59:59Z) - How to train your ViT? Data, Augmentation, and Regularization in Vision
Transformers [74.06040005144382]
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications.
We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget.
We train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
arXiv Detail & Related papers (2021-06-18T17:58:20Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.