Auto-scaling Vision Transformers without Training
- URL: http://arxiv.org/abs/2202.11921v2
- Date: Sun, 27 Feb 2022 21:38:54 GMT
- Title: Auto-scaling Vision Transformers without Training
- Authors: Wuyang Chen, Wei Huang, Xianzhi Du, Xiaodan Song, Zhangyang Wang,
Denny Zhou
- Abstract summary: We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
- Score: 84.34662535276898
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work targets automated designing and scaling of Vision Transformers
(ViTs). The motivation comes from two pain spots: 1) the lack of efficient and
principled methods for designing and scaling ViTs; 2) the tremendous
computational cost of training ViT that is much heavier than its convolution
counterpart. To tackle these issues, we propose As-ViT, an auto-scaling
framework for ViTs without training, which automatically discovers and scales
up ViTs in an efficient and principled manner. Specifically, we first design a
"seed" ViT topology by leveraging a training-free search process. This
extremely fast search is fulfilled by a comprehensive study of ViT's network
complexity, yielding a strong Kendall-tau correlation with ground-truth
accuracies. Second, starting from the "seed" topology, we automate the scaling
rule for ViTs by growing widths/depths to different ViT layers. This results in
a series of architectures with different numbers of parameters in a single run.
Finally, based on the observation that ViTs can tolerate coarse tokenization in
early training stages, we propose a progressive tokenization strategy to train
ViTs faster and cheaper. As a unified framework, As-ViT achieves strong
performance on classification (83.5% top1 on ImageNet-1k) and detection (52.7%
mAP on COCO) without any manual crafting nor scaling of ViT architectures: the
end-to-end model design and scaling process cost only 12 hours on one V100 GPU.
Our code is available at https://github.com/VITA-Group/AsViT.
Related papers
- Rethinking Hierarchicies in Pre-trained Plain Vision Transformer [76.35955924137986]
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective.
customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT.
This paper proposes a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training.
arXiv Detail & Related papers (2022-11-03T13:19:23Z) - Training-free Transformer Architecture Search [89.88412583106741]
Vision Transformer (ViT) has achieved remarkable success in several computer vision tasks.
Current Transformer Architecture Search (TAS) methods are time-consuming and existing zero-cost proxies in CNN do not generalize well to the ViT search space.
In this paper, for the first time, we investigate how to conduct TAS in a training-free manner and devise an effective training-free TAS scheme.
arXiv Detail & Related papers (2022-03-23T06:06:54Z) - Unified Visual Transformer Compression [102.26265546836329]
This paper proposes a unified ViT compression framework that seamlessly assembles three effective techniques: pruning, layer skipping, and knowledge distillation.
We formulate a budget-constrained, end-to-end optimization framework, targeting jointly learning model weights, layer-wise pruning ratios/masks, and skip configurations.
Experiments are conducted with several ViT variants, e.g. DeiT and T2T-ViT backbones on the ImageNet dataset, and our approach consistently outperforms recent competitors.
arXiv Detail & Related papers (2022-03-15T20:38:22Z) - How to augment your ViTs? Consistency loss and StyleAug, a random style
transfer augmentation [4.3012765978447565]
The Vision Transformer (ViT) architecture has recently achieved competitive performance across a variety of computer vision tasks.
One of the motivations behind ViTs is weaker inductive biases, when compared to convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-12-16T23:56:04Z) - A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks.
We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs.
Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.