Related papers: Auto-scaling Vision Transformers without Training

Auto-scaling Vision Transformers without Training

URL: http://arxiv.org/abs/2202.11921v2
Date: Sun, 27 Feb 2022 21:38:54 GMT
Title: Auto-scaling Vision Transformers without Training
Authors: Wuyang Chen, Wei Huang, Xianzhi Du, Xiaodan Song, Zhangyang Wang, Denny Zhou
Abstract summary: We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training. As-ViT automatically discovers and scales up ViTs in an efficient and principled manner. As a unified framework, As-ViT achieves strong performance on classification and detection.
Score: 84.34662535276898
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This work targets automated designing and scaling of Vision Transformers (ViTs). The motivation comes from two pain spots: 1) the lack of efficient and principled methods for designing and scaling ViTs; 2) the tremendous computational cost of training ViT that is much heavier than its convolution counterpart. To tackle these issues, we propose As-ViT, an auto-scaling framework for ViTs without training, which automatically discovers and scales up ViTs in an efficient and principled manner. Specifically, we first design a "seed" ViT topology by leveraging a training-free search process. This extremely fast search is fulfilled by a comprehensive study of ViT's network complexity, yielding a strong Kendall-tau correlation with ground-truth accuracies. Second, starting from the "seed" topology, we automate the scaling rule for ViTs by growing widths/depths to different ViT layers. This results in a series of architectures with different numbers of parameters in a single run. Finally, based on the observation that ViTs can tolerate coarse tokenization in early training stages, we propose a progressive tokenization strategy to train ViTs faster and cheaper. As a unified framework, As-ViT achieves strong performance on classification (83.5% top1 on ImageNet-1k) and detection (52.7% mAP on COCO) without any manual crafting nor scaling of ViT architectures: the end-to-end model design and scaling process cost only 12 hours on one V100 GPU. Our code is available at https://github.com/VITA-Group/AsViT.

Related papers

Rethinking Hierarchicies in Pre-trained Plain Vision Transformer [76.35955924137986]
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective. customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT. This paper proposes a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training.
arXiv Detail & Related papers (2022-11-03T13:19:23Z)
Training-free Transformer Architecture Search [89.88412583106741]
Vision Transformer (ViT) has achieved remarkable success in several computer vision tasks. Current Transformer Architecture Search (TAS) methods are time-consuming and existing zero-cost proxies in CNN do not generalize well to the ViT search space. In this paper, for the first time, we investigate how to conduct TAS in a training-free manner and devise an effective training-free TAS scheme.
arXiv Detail & Related papers (2022-03-23T06:06:54Z)
Unified Visual Transformer Compression [102.26265546836329]
This paper proposes a unified ViT compression framework that seamlessly assembles three effective techniques: pruning, layer skipping, and knowledge distillation. We formulate a budget-constrained, end-to-end optimization framework, targeting jointly learning model weights, layer-wise pruning ratios/masks, and skip configurations. Experiments are conducted with several ViT variants, e.g. DeiT and T2T-ViT backbones on the ImageNet dataset, and our approach consistently outperforms recent competitors.
arXiv Detail & Related papers (2022-03-15T20:38:22Z)
How to augment your ViTs? Consistency loss and StyleAug, a random style transfer augmentation [4.3012765978447565]
The Vision Transformer (ViT) architecture has recently achieved competitive performance across a variety of computer vision tasks. One of the motivations behind ViTs is weaker inductive biases, when compared to convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-12-16T23:56:04Z)
A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks. We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs. Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z)
Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks. We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT. Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z)
DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently. We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.