Bootstrapping ViTs: Towards Liberating Vision Transformers from
Pre-training
- URL: http://arxiv.org/abs/2112.03552v2
- Date: Thu, 9 Dec 2021 16:28:53 GMT
- Title: Bootstrapping ViTs: Towards Liberating Vision Transformers from
Pre-training
- Authors: Haofei Zhang, Jiarui Duan, Mengqi Xue, Jie Song, Li Sun, Mingli Song
- Abstract summary: Vision Transformers (ViTs) are developing rapidly and starting to challenge the domination of convolutional neural networks (CNNs) in computer vision.
This paper introduces CNNs' inductive biases back to ViTs while preserving their network architectures for higher upper bound.
Experiments on CIFAR-10/100 and ImageNet-1k with limited training data have shown encouraging results.
- Score: 29.20567759071523
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, vision Transformers (ViTs) are developing rapidly and starting to
challenge the domination of convolutional neural networks (CNNs) in the realm
of computer vision (CV). With the general-purpose Transformer architecture for
replacing the hard-coded inductive biases of convolution, ViTs have surpassed
CNNs, especially in data-sufficient circumstances. However, ViTs are prone to
over-fit on small datasets and thus rely on large-scale pre-training, which
expends enormous time. In this paper, we strive to liberate ViTs from
pre-training by introducing CNNs' inductive biases back to ViTs while
preserving their network architectures for higher upper bound and setting up
more suitable optimization objectives. To begin with, an agent CNN is designed
based on the given ViT with inductive biases. Then a bootstrapping training
algorithm is proposed to jointly optimize the agent and ViT with weight
sharing, during which the ViT learns inductive biases from the intermediate
features of the agent. Extensive experiments on CIFAR-10/100 and ImageNet-1k
with limited training data have shown encouraging results that the inductive
biases help ViTs converge significantly faster and outperform conventional CNNs
with even fewer parameters.
Related papers
- Structured Initialization for Attention in Vision Transformers [34.374054040300805]
convolutional neural networks (CNNs) have an architectural inductive bias enabling them to perform well on small-scale problems.
We argue that the architectural bias inherent to CNNs can be reinterpreted as an initialization bias within ViT.
This insight is significant as it empowers ViTs to perform equally well on small-scale problems while maintaining their flexibility for large-scale applications.
arXiv Detail & Related papers (2024-04-01T14:34:47Z) - Convolutional Embedding Makes Hierarchical Vision Transformer Stronger [16.72943631060293]
Vision Transformers (ViTs) have recently dominated a range of computer vision tasks, yet it suffers from low training data efficiency and inferior local semantic representation capability without appropriate inductive bias.
CNNs inherently capture regional-aware semantics, inspiring researchers to introduce CNNs back into the architecture of the ViTs to provide desirable inductive bias for ViTs.
In this paper, we explore how the macro architecture of the hybrid CNNs/ViTs enhances the performances of hierarchical ViTs.
arXiv Detail & Related papers (2022-07-27T06:36:36Z) - Self-Distilled Vision Transformer for Domain Generalization [58.76055100157651]
Vision transformers (ViTs) are challenging the supremacy of CNNs on standard benchmarks.
We propose a simple DG approach for ViTs, coined as self-distillation for ViTs.
We empirically demonstrate notable performance gains with different DG baselines and various ViT backbones in five challenging datasets.
arXiv Detail & Related papers (2022-07-25T17:57:05Z) - EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision
Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision.
We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z) - Improving Vision Transformers by Revisiting High-frequency Components [106.7140968644414]
We show that Vision Transformer (ViT) models are less effective in capturing the high-frequency components of images than CNN models.
To compensate, we propose HAT, which directly augments high-frequency components of images via adversarial training.
We show that HAT can consistently boost the performance of various ViT models.
arXiv Detail & Related papers (2022-04-03T05:16:51Z) - ViT-P: Rethinking Data-efficient Vision Transformers from Locality [9.515925867530262]
We make vision transformers as data-efficient as convolutional neural networks by introducing multi-focal attention bias.
Inspired by the attention distance in a well-trained ViT, we constrain the self-attention of ViT to have multi-scale localized receptive field.
On Cifar100, our ViT-P Base model achieves the state-of-the-art accuracy (83.16%) trained from scratch.
arXiv Detail & Related papers (2022-03-04T14:49:48Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - Scaled ReLU Matters for Training Vision Transformers [45.41439457701873]
Vision transformers (ViTs) have been an alternative design paradigm to convolutional neural networks (CNNs)
However, the training of ViTs is much harder than CNNs, as it is sensitive to the training parameters, such as learning rate, warmup and warmup.
We verify, both theoretically and empirically, that scaled ReLU in textitconv-stem not only improves training stabilization, but also increases the diversity of patch tokens.
arXiv Detail & Related papers (2021-09-08T17:57:58Z) - When Vision Transformers Outperform ResNets without Pretraining or
Strong Data Augmentations [111.44860506703307]
Vision Transformers (ViTs) and existing VisionNets signal efforts on replacing hand-wired features or inductive throughputs with general-purpose neural architectures.
This paper investigates ViTs and Res-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and inference.
We show that the improved robustness attributes to sparser active neurons in the first few layers.
The resultant ViTs outperform Nets of similar size and smoothness when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations.
arXiv Detail & Related papers (2021-06-03T02:08:03Z) - On the Adversarial Robustness of Visual Transformers [129.29523847765952]
This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations.
Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-03-29T14:48:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.