Related papers: Efficient Training of Visual Transformers with Small-Size Datasets

Efficient Training of Visual Transformers with Small-Size Datasets

URL: http://arxiv.org/abs/2106.03746v1
Date: Mon, 7 Jun 2021 16:14:06 GMT
Title: Efficient Training of Visual Transformers with Small-Size Datasets
Authors: Yahui Liu, Enver Sangineto, Wei Bi, Nicu Sebe, Bruno Lepri and Marco De Nadai
Abstract summary: Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs) We show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different. We propose a self-supervised task which can extract additional information from images with only a negligible computational overhead.
Score: 64.60765211331697
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs). Differently from CNNs, VTs can capture global relations between image elements and they potentially have a larger representation capacity. However, the lack of the typical convolutional inductive bias makes these models more data-hungry than common CNNs. In fact, some local properties of the visual domain which are embedded in the CNN architectural design, in VTs should be learned from samples. In this paper, we empirically analyse different VTs, comparing their robustness in a small training-set regime, and we show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different. Moreover, we propose a self-supervised task which can extract additional information from images with only a negligible computational overhead. This task encourages the VTs to learn spatial relations within an image and makes the VT training much more robust when training data are scarce. Our task is used jointly with the standard (supervised) training and it does not depend on specific architectural choices, thus it can be easily plugged in the existing VTs. Using an extensive evaluation with different VTs and datasets, we show that our method can improve (sometimes dramatically) the final accuracy of the VTs. The code will be available upon acceptance.

Related papers

Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets. DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z)
Masked autoencoders are effective solution to transformer data-hungry [0.0]
Vision Transformers (ViTs) outperforms convolutional neural networks (CNNs) in several vision tasks with its global modeling capabilities. ViT lacks the inductive bias inherent to convolution making it require a large amount of data for training. Masked autoencoders (MAE) can make the transformer focus more on the image itself.
arXiv Detail & Related papers (2022-12-12T03:15:19Z)
How Well Do Vision Transformers (VTs) Transfer To The Non-Natural Image Domain? An Empirical Study Involving Art Classification [0.7614628596146599]
Vision Transformers (VTs) are becoming a valuable alternative to Convolutional Neural Networks (CNNs) We study whether VTs that are pre-trained on the popular ImageNet dataset learn representations that are transferable to the non-natural image domain. Our results show that VTs exhibit strong generalization properties and that these networks are more powerful feature extractors than CNNs.
arXiv Detail & Related papers (2022-08-09T12:05:18Z)
Locality Guidance for Improving Vision Transformers on Tiny Datasets [17.352384588114838]
Vision Transformer (VT) architecture is becoming trendy in computer vision, but pure VT models perform poorly on tiny datasets. This paper proposes the locality guidance for improving the performance of VTs on tiny datasets.
arXiv Detail & Related papers (2022-07-20T16:41:41Z)
Towards Scale Consistent Monocular Visual Odometry by Learning from the Virtual World [83.36195426897768]
We propose VRVO, a novel framework for retrieving the absolute scale from virtual data. We first train a scale-aware disparity network using both monocular real images and stereo virtual data. The resulting scale-consistent disparities are then integrated with a direct VO system.
arXiv Detail & Related papers (2022-03-11T01:51:54Z)
Training Vision Transformers with Only 2040 Images [35.86457465241119]
Vision Transformers (ViTs) is emerging as an alternative to convolutional neural networks (CNNs) for visual recognition. We give theoretical analyses that our method is superior to other methods in that it can capture both feature alignment and instance similarities. We achieve state-of-the-art results when training from scratch on 7 small datasets under various ViT backbones.
arXiv Detail & Related papers (2022-01-26T03:22:08Z)
A Comprehensive Study of Vision Transformers on Dense Prediction Tasks [10.013443811899466]
Convolutional Neural Networks (CNNs) have been the standard choice in vision tasks. Recent studies have shown that Vision Transformers (VTs) achieve comparable performance in challenging tasks such as object detection and semantic segmentation. This poses several questions about their generalizability, robustness, reliability, and texture bias when used to extract features for complex tasks.
arXiv Detail & Related papers (2022-01-21T13:18:16Z)
BEVT: BERT Pretraining of Video Transformers [89.08460834954161]
We introduce BEVT which decouples video representation learning into spatial representation learning and temporal dynamics learning. We conduct extensive experiments on three challenging video benchmarks where BEVT achieves very promising results.
arXiv Detail & Related papers (2021-12-02T18:59:59Z)
How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers [74.06040005144382]
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications. We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget. We train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
arXiv Detail & Related papers (2021-06-18T17:58:20Z)
Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.