Locality Guidance for Improving Vision Transformers on Tiny Datasets
- URL: http://arxiv.org/abs/2207.10026v1
- Date: Wed, 20 Jul 2022 16:41:41 GMT
- Title: Locality Guidance for Improving Vision Transformers on Tiny Datasets
- Authors: Kehan Li, Runyi Yu, Zhennan Wang, Li Yuan, Guoli Song, Jie Chen
- Abstract summary: Vision Transformer (VT) architecture is becoming trendy in computer vision, but pure VT models perform poorly on tiny datasets.
This paper proposes the locality guidance for improving the performance of VTs on tiny datasets.
- Score: 17.352384588114838
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While the Vision Transformer (VT) architecture is becoming trendy in computer
vision, pure VT models perform poorly on tiny datasets. To address this issue,
this paper proposes the locality guidance for improving the performance of VTs
on tiny datasets. We first analyze that the local information, which is of
great importance for understanding images, is hard to be learned with limited
data due to the high flexibility and intrinsic globality of the self-attention
mechanism in VTs. To facilitate local information, we realize the locality
guidance for VTs by imitating the features of an already trained convolutional
neural network (CNN), inspired by the built-in local-to-global hierarchy of
CNN. Under our dual-task learning paradigm, the locality guidance provided by a
lightweight CNN trained on low-resolution images is adequate to accelerate the
convergence and improve the performance of VTs to a large extent. Therefore,
our locality guidance approach is very simple and efficient, and can serve as a
basic performance enhancement method for VTs on tiny datasets. Extensive
experiments demonstrate that our method can significantly improve VTs when
training from scratch on tiny datasets and is compatible with different kinds
of VTs and datasets. For example, our proposed method can boost the performance
of various VTs on tiny datasets (e.g., 13.07% for DeiT, 8.98% for T2T and 7.85%
for PVT), and enhance even stronger baseline PVTv2 by 1.86% to 79.30%, showing
the potential of VTs on tiny datasets. The code is available at
https://github.com/lkhl/tiny-transformers.
Related papers
- Visual Fourier Prompt Tuning [63.66866445034855]
We propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models.
Our approach incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information.
Our results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2024-11-02T18:18:35Z) - Optimizing Vision Transformers with Data-Free Knowledge Transfer [8.323741354066474]
Vision transformers (ViTs) have excelled in various computer vision tasks due to their superior ability to capture long-distance dependencies.
We propose compressing large ViT models using Knowledge Distillation (KD), which is implemented data-free to circumvent limitations related to data availability.
arXiv Detail & Related papers (2024-08-12T07:03:35Z) - Scattering Vision Transformer: Spectral Mixing Matters [3.0665715162712837]
We present a novel approach called Scattering Vision Transformer (SVT) to tackle these challenges.
SVT incorporates a spectrally scattering network that enables the capture of intricate image details.
SVT achieves state-of-the-art performance on the ImageNet dataset with a significant reduction in a number of parameters and FLOPS.
arXiv Detail & Related papers (2023-11-02T15:24:23Z) - Enhancing Performance of Vision Transformers on Small Datasets through
Local Inductive Bias Incorporation [13.056764072568749]
Vision transformers (ViTs) achieve remarkable performance on large datasets, but tend to perform worse than convolutional neural networks (CNNs) on smaller datasets.
We propose a module called Local InFormation Enhancer (LIFE) that extracts patch-level local information and incorporates it into the embeddings used in the self-attention block of ViTs.
Our proposed module is memory and efficient, as well as flexible enough to process auxiliary tokens such as the classification and distillation tokens.
arXiv Detail & Related papers (2023-05-15T11:23:18Z) - GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous
Structured Pruning for Vision Transformer [76.2625311630021]
Vision transformers (ViTs) have shown very impressive empirical performance in various computer vision tasks.
To mitigate this challenging problem, structured pruning is a promising solution to compress model size and enable practical efficiency.
We propose GOHSP, a unified framework of Graph and Optimization-based Structured Pruning for ViT models.
arXiv Detail & Related papers (2023-01-13T00:40:24Z) - How to Train Vision Transformer on Small-scale Datasets? [4.56717163175988]
In contrast to convolutional neural networks, Vision Transformer lacks inherent inductive biases.
We show that self-supervised inductive biases can be learned directly from small-scale datasets.
This allows to train these models without large-scale pre-training, changes to model architecture or loss functions.
arXiv Detail & Related papers (2022-10-13T17:59:19Z) - Bridging the Gap Between Vision Transformers and Convolutional Neural
Networks on Small Datasets [91.25055890980084]
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets.
We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases.
Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
arXiv Detail & Related papers (2022-10-12T06:54:39Z) - Efficient Training of Visual Transformers with Small-Size Datasets [64.60765211331697]
Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs)
We show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different.
We propose a self-supervised task which can extract additional information from images with only a negligible computational overhead.
arXiv Detail & Related papers (2021-06-07T16:14:06Z) - When Vision Transformers Outperform ResNets without Pretraining or
Strong Data Augmentations [111.44860506703307]
Vision Transformers (ViTs) and existing VisionNets signal efforts on replacing hand-wired features or inductive throughputs with general-purpose neural architectures.
This paper investigates ViTs and Res-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and inference.
We show that the improved robustness attributes to sparser active neurons in the first few layers.
The resultant ViTs outperform Nets of similar size and smoothness when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations.
arXiv Detail & Related papers (2021-06-03T02:08:03Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.