TurboViT: Generating Fast Vision Transformers via Generative
Architecture Search
- URL: http://arxiv.org/abs/2308.11421v1
- Date: Tue, 22 Aug 2023 13:08:29 GMT
- Title: TurboViT: Generating Fast Vision Transformers via Generative
Architecture Search
- Authors: Alexander Wong, Saad Abbasi, Saeejith Nair
- Abstract summary: Vision transformers have shown unprecedented levels of performance in tackling various visual perception tasks in recent years.
There has been significant research recently on the design of efficient vision transformer architecture.
In this study, we explore the generation of fast vision transformer architecture designs via generative architecture search.
- Score: 74.24393546346974
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers have shown unprecedented levels of performance in
tackling various visual perception tasks in recent years. However, the
architectural and computational complexity of such network architectures have
made them challenging to deploy in real-world applications with
high-throughput, low-memory requirements. As such, there has been significant
research recently on the design of efficient vision transformer architectures.
In this study, we explore the generation of fast vision transformer
architecture designs via generative architecture search (GAS) to achieve a
strong balance between accuracy and architectural and computational efficiency.
Through this generative architecture search process, we create TurboViT, a
highly efficient hierarchical vision transformer architecture design that is
generated around mask unit attention and Q-pooling design patterns. The
resulting TurboViT architecture design achieves significantly lower
architectural computational complexity (>2.47$\times$ smaller than FasterViT-0
while achieving same accuracy) and computational complexity (>3.4$\times$ fewer
FLOPs and 0.9% higher accuracy than MobileViT2-2.0) when compared to 10 other
state-of-the-art efficient vision transformer network architecture designs
within a similar range of accuracy on the ImageNet-1K dataset. Furthermore,
TurboViT demonstrated strong inference latency and throughput in both
low-latency and batch processing scenarios (>3.21$\times$ lower latency and
>3.18$\times$ higher throughput compared to FasterViT-0 for low-latency
scenario). These promising results demonstrate the efficacy of leveraging
generative architecture search for generating efficient transformer
architecture designs for high-throughput scenarios.
Related papers
- Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications.
The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate.
There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z) - Rethinking Vision Transformers for MobileNet Size and Speed [58.01406896628446]
We propose a novel supernet with low latency and high parameter efficiency.
We also introduce a novel fine-grained joint search strategy for transformer models.
This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
arXiv Detail & Related papers (2022-12-15T18:59:12Z) - Efficient Neural Net Approaches in Metal Casting Defect Detection [0.0]
This research proposes a lightweight architecture that is efficient in terms of accuracy and inference time.
Our results indicate that a custom model of 590K parameters with depth-wise separable convolutions outperformed pretrained architectures.
arXiv Detail & Related papers (2022-08-08T13:54:36Z) - Neural Architecture Search on Efficient Transformers and Beyond [23.118556295894376]
We propose a new framework to find optimal architectures for efficient Transformers with the neural architecture search (NAS) technique.
We observe that the optimal architecture of the efficient Transformer has the reduced computation compared with that of the standard Transformer.
Our searched architecture maintains comparable accuracy to the standard Transformer with notably improved computational efficiency.
arXiv Detail & Related papers (2022-07-28T08:41:41Z) - A Simple Single-Scale Vision Transformer for Object Localization and
Instance Segmentation [79.265315267391]
We propose a simple and compact ViT architecture called Universal Vision Transformer (UViT)
UViT achieves strong performance on object detection and instance segmentation tasks.
arXiv Detail & Related papers (2021-12-17T20:11:56Z) - Vision Transformer Architecture Search [64.73920718915282]
Current vision transformers (ViTs) are simply inherited from natural language processing (NLP) tasks.
We propose an architecture search method, dubbed ViTAS, to search for the optimal architecture with similar hardware budgets.
Our searched architecture achieves $74.7%$ top-$1$ accuracy on ImageNet and is $2.5%$ superior than the current baseline ViT architecture.
arXiv Detail & Related papers (2021-06-25T15:39:08Z) - Twins: Revisiting Spatial Attention Design in Vision Transformers [81.02454258677714]
In this work, we demonstrate that a carefully-devised yet simple spatial attention mechanism performs favourably against the state-of-the-art schemes.
We propose two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT.
Our proposed architectures are highly-efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks.
arXiv Detail & Related papers (2021-04-28T15:42:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.