Training-free Transformer Architecture Search
- URL: http://arxiv.org/abs/2203.12217v1
- Date: Wed, 23 Mar 2022 06:06:54 GMT
- Title: Training-free Transformer Architecture Search
- Authors: Qinqin Zhou, Kekai Sheng, Xiawu Zheng, Ke Li, Xing Sun, Yonghong Tian,
Jie Chen, Rongrong Ji
- Abstract summary: Vision Transformer (ViT) has achieved remarkable success in several computer vision tasks.
Current Transformer Architecture Search (TAS) methods are time-consuming and existing zero-cost proxies in CNN do not generalize well to the ViT search space.
In this paper, for the first time, we investigate how to conduct TAS in a training-free manner and devise an effective training-free TAS scheme.
- Score: 89.88412583106741
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Vision Transformer (ViT) has achieved remarkable success in several
computer vision tasks. The progresses are highly relevant to the architecture
design, then it is worthwhile to propose Transformer Architecture Search (TAS)
to search for better ViTs automatically. However, current TAS methods are
time-consuming and existing zero-cost proxies in CNN do not generalize well to
the ViT search space according to our experimental observations. In this paper,
for the first time, we investigate how to conduct TAS in a training-free manner
and devise an effective training-free TAS (TF-TAS) scheme. Firstly, we observe
that the properties of multi-head self-attention (MSA) and multi-layer
perceptron (MLP) in ViTs are quite different and that the synaptic diversity of
MSA affects the performance notably. Secondly, based on the observation, we
devise a modular strategy in TF-TAS that evaluates and ranks ViT architectures
from two theoretical perspectives: synaptic diversity and synaptic saliency,
termed as DSS-indicator. With DSS-indicator, evaluation results are strongly
correlated with the test accuracies of ViT models. Experimental results
demonstrate that our TF-TAS achieves a competitive performance against the
state-of-the-art manually or automatically design ViT architectures, and it
promotes the searching efficiency in ViT search space greatly: from about $24$
GPU days to less than $0.5$ GPU days. Moreover, the proposed DSS-indicator
outperforms the existing cutting-edge zero-cost approaches (e.g., TE-score and
NASWOT).
Related papers
- Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers [56.37495946212932]
Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs)
This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs.
arXiv Detail & Related papers (2024-07-25T16:35:46Z) - TRT-ViT: TensorRT-oriented Vision Transformer [19.173764508139016]
A family ofRT-oriented Transformers is presented, abbreviated as TRT-ViT.
Extensive experiments demonstrate that TRT-ViT significantly outperforms existing ConvNets and vision Transformers.
arXiv Detail & Related papers (2022-05-19T14:20:25Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks.
We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs.
Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - Vision Transformer Architecture Search [64.73920718915282]
Current vision transformers (ViTs) are simply inherited from natural language processing (NLP) tasks.
We propose an architecture search method, dubbed ViTAS, to search for the optimal architecture with similar hardware budgets.
Our searched architecture achieves $74.7%$ top-$1$ accuracy on ImageNet and is $2.5%$ superior than the current baseline ViT architecture.
arXiv Detail & Related papers (2021-06-25T15:39:08Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.