Vision Transformer Architecture Search
- URL: http://arxiv.org/abs/2106.13700v1
- Date: Fri, 25 Jun 2021 15:39:08 GMT
- Title: Vision Transformer Architecture Search
- Authors: Xiu Su, Shan You, Jiyang Xie, Mingkai Zheng, Fei Wang, Chen Qian,
Changshui Zhang, Xiaogang Wang, Chang Xu
- Abstract summary: Current vision transformers (ViTs) are simply inherited from natural language processing (NLP) tasks.
We propose an architecture search method, dubbed ViTAS, to search for the optimal architecture with similar hardware budgets.
Our searched architecture achieves $74.7%$ top-$1$ accuracy on ImageNet and is $2.5%$ superior than the current baseline ViT architecture.
- Score: 64.73920718915282
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, transformers have shown great superiority in solving computer
vision tasks by modeling images as a sequence of manually-split patches with
self-attention mechanism. However, current architectures of vision transformers
(ViTs) are simply inherited from natural language processing (NLP) tasks and
have not been sufficiently investigated and optimized. In this paper, we make a
further step by examining the intrinsic structure of transformers for vision
tasks and propose an architecture search method, dubbed ViTAS, to search for
the optimal architecture with similar hardware budgets. Concretely, we design a
new effective yet efficient weight sharing paradigm for ViTs, such that
architectures with different token embedding, sequence size, number of heads,
width, and depth can be derived from a single super-transformer. Moreover, to
cater for the variance of distinct architectures, we introduce \textit{private}
class token and self-attention maps in the super-transformer. In addition, to
adapt the searching for different budgets, we propose to search the sampling
probability of identity operation. Experimental results show that our ViTAS
attains excellent results compared to existing pure transformer architectures.
For example, with $1.3$G FLOPs budget, our searched architecture achieves
$74.7\%$ top-$1$ accuracy on ImageNet and is $2.5\%$ superior than the current
baseline ViT architecture. Code is available at
\url{https://github.com/xiusu/ViTAS}.
Related papers
- PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - TurboViT: Generating Fast Vision Transformers via Generative
Architecture Search [74.24393546346974]
Vision transformers have shown unprecedented levels of performance in tackling various visual perception tasks in recent years.
There has been significant research recently on the design of efficient vision transformer architecture.
In this study, we explore the generation of fast vision transformer architecture designs via generative architecture search.
arXiv Detail & Related papers (2023-08-22T13:08:29Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - MPViT: Multi-Path Vision Transformer for Dense Prediction [43.89623453679854]
Vision Transformers (ViTs) build a simple multi-stage structure for multi-scale representation with single-scale patches.
OuriTs scaling from tiny(5M) to base(73M) consistently achieve superior performance over state-of-the-art Vision Transformers.
arXiv Detail & Related papers (2021-12-21T06:34:50Z) - A Simple Single-Scale Vision Transformer for Object Localization and
Instance Segmentation [79.265315267391]
We propose a simple and compact ViT architecture called Universal Vision Transformer (UViT)
UViT achieves strong performance on object detection and instance segmentation tasks.
arXiv Detail & Related papers (2021-12-17T20:11:56Z) - Searching the Search Space of Vision Transformer [98.96601221383209]
Vision Transformer has shown great visual representation power in substantial vision tasks such as recognition and detection.
We propose to use neural architecture search to automate this process, by searching not only the architecture but also the search space.
We provide design guidelines of general vision transformers with extensive analysis according to the space searching process.
arXiv Detail & Related papers (2021-11-29T17:26:07Z) - A Survey of Visual Transformers [30.082304742571598]
Transformer, an attention-based encoder-decoder architecture, has revolutionized the field of natural language processing.
Some pioneering works have recently been done on adapting Transformer architectures to Computer Vision (CV) fields.
We have provided a comprehensive review of over one hundred different visual Transformers for three fundamental CV tasks.
arXiv Detail & Related papers (2021-11-11T07:56:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.