Twins: Revisiting Spatial Attention Design in Vision Transformers
- URL: http://arxiv.org/abs/2104.13840v1
- Date: Wed, 28 Apr 2021 15:42:31 GMT
- Title: Twins: Revisiting Spatial Attention Design in Vision Transformers
- Authors: Xiangxiang Chu and Zhi Tian and Yuqing Wang and Bo Zhang and Haibing
Ren and Xiaolin Wei and Huaxia Xia and Chunhua Shen
- Abstract summary: In this work, we demonstrate that a carefully-devised yet simple spatial attention mechanism performs favourably against the state-of-the-art schemes.
We propose two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT.
Our proposed architectures are highly-efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks.
- Score: 81.02454258677714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Very recently, a variety of vision transformer architectures for dense
prediction tasks have been proposed and they show that the design of spatial
attention is critical to their success in these tasks. In this work, we revisit
the design of the spatial attention and demonstrate that a carefully-devised
yet simple spatial attention mechanism performs favourably against the
state-of-the-art schemes. As a result, we propose two vision transformer
architectures, namely, Twins-PCPVT and Twins-SVT. Our proposed architectures
are highly-efficient and easy to implement, only involving matrix
multiplications that are highly optimized in modern deep learning frameworks.
More importantly, the proposed architectures achieve excellent performance on a
wide range of visual tasks including imagelevel classification as well as dense
detection and segmentation. The simplicity and strong performance suggest that
our proposed architectures may serve as stronger backbones for many vision
tasks. Our code will be released soon at
https://github.com/Meituan-AutoML/Twins .
Related papers
- You Only Need Less Attention at Each Stage in Vision Transformers [19.660385306028047]
Vision Transformers (ViTs) capture the global information of images through self-attention modules.
We propose the Less-Attention Vision Transformer (LaViT), which computes only a few attention operations at each stage.
Our architecture demonstrates exceptional performance across various vision tasks including classification, detection and segmentation.
arXiv Detail & Related papers (2024-06-01T12:49:16Z) - Activator: GLU Activation Function as the Core Component of a Vision Transformer [1.3812010983144802]
Transformer architecture currently represents the main driver behind many successes in a variety of tasks addressed by deep learning.
This paper investigates substituting the attention mechanism usually adopted for transformer architecture with incorporating linear gated unit (GLU) activation within a multi-layer perceptron architecture.
arXiv Detail & Related papers (2024-05-24T21:46:52Z) - TurboViT: Generating Fast Vision Transformers via Generative
Architecture Search [74.24393546346974]
Vision transformers have shown unprecedented levels of performance in tackling various visual perception tasks in recent years.
There has been significant research recently on the design of efficient vision transformer architecture.
In this study, we explore the generation of fast vision transformer architecture designs via generative architecture search.
arXiv Detail & Related papers (2023-08-22T13:08:29Z) - Dynamic Spatial Sparsification for Efficient Vision Transformers and
Convolutional Neural Networks [88.77951448313486]
We present a new approach for model acceleration by exploiting spatial sparsity in visual data.
We propose a dynamic token sparsification framework to prune redundant tokens.
We extend our method to hierarchical models including CNNs and hierarchical vision Transformers.
arXiv Detail & Related papers (2022-07-04T17:00:51Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - A Simple Single-Scale Vision Transformer for Object Localization and
Instance Segmentation [79.265315267391]
We propose a simple and compact ViT architecture called Universal Vision Transformer (UViT)
UViT achieves strong performance on object detection and instance segmentation tasks.
arXiv Detail & Related papers (2021-12-17T20:11:56Z) - Vision Transformer Architecture Search [64.73920718915282]
Current vision transformers (ViTs) are simply inherited from natural language processing (NLP) tasks.
We propose an architecture search method, dubbed ViTAS, to search for the optimal architecture with similar hardware budgets.
Our searched architecture achieves $74.7%$ top-$1$ accuracy on ImageNet and is $2.5%$ superior than the current baseline ViT architecture.
arXiv Detail & Related papers (2021-06-25T15:39:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.