RegionViT: Regional-to-Local Attention for Vision Transformers
- URL: http://arxiv.org/abs/2106.02689v1
- Date: Fri, 4 Jun 2021 19:57:11 GMT
- Title: RegionViT: Regional-to-Local Attention for Vision Transformers
- Authors: Chun-Fu Chen, Rameswar Panda, Quanfu Fan
- Abstract summary: Vision transformer (ViT) has recently showed its strong capability in achieving comparable results to convolutional neural networks (CNNs) on image classification.
We propose a new architecture that adopts the pyramid structure and employ a novel regional-to-local attention.
Our approach outperforms or is on par with state-of-the-art ViT variants including many concurrent works.
- Score: 17.70988054450176
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformer (ViT) has recently showed its strong capability in
achieving comparable results to convolutional neural networks (CNNs) on image
classification. However, vanilla ViT simply inherits the same architecture from
the natural language processing directly, which is often not optimized for
vision applications. Motivated by this, in this paper, we propose a new
architecture that adopts the pyramid structure and employ a novel
regional-to-local attention rather than global self-attention in vision
transformers. More specifically, our model first generates regional tokens and
local tokens from an image with different patch sizes, where each regional
token is associated with a set of local tokens based on the spatial location.
The regional-to-local attention includes two steps: first, the regional
self-attention extract global information among all regional tokens and then
the local self-attention exchanges the information among one regional token and
the associated local tokens via self-attention. Therefore, even though local
self-attention confines the scope in a local region but it can still receive
global information. Extensive experiments on three vision tasks, including
image classification, object detection and action recognition, show that our
approach outperforms or is on par with state-of-the-art ViT variants including
many concurrent works. Our source codes and models will be publicly available.
Related papers
- TCFormer: Visual Recognition via Token Clustering Transformer [79.24723479088097]
We propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning.
Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens.
arXiv Detail & Related papers (2024-07-16T02:26:18Z) - Other Tokens Matter: Exploring Global and Local Features of Vision Transformers for Object Re-Identification [63.147482497821166]
We first explore the influence of global and local features of ViT and then propose a novel Global-Local Transformer (GLTrans) for high-performance object Re-ID.
Our proposed method achieves superior performance on four object Re-ID benchmarks.
arXiv Detail & Related papers (2024-04-23T12:42:07Z) - Semantic-Aware Local-Global Vision Transformer [24.55333039729068]
We propose the Semantic-Aware Local-Global Vision Transformer (SALG)
Our SALG performs semantic segmentation in an unsupervised way to explore the underlying semantic priors in the image.
Our model is able to obtain the global view when learning features for each token.
arXiv Detail & Related papers (2022-11-27T03:16:00Z) - L2G: A Simple Local-to-Global Knowledge Transfer Framework for Weakly
Supervised Semantic Segmentation [67.26984058377435]
We present L2G, a simple online local-to-global knowledge transfer framework for high-quality object attention mining.
Our framework conducts the global network to learn the captured rich object detail knowledge from a global view.
Experiments show that our method attains 72.1% and 44.2% mIoU scores on the validation set of PASCAL VOC 2012 and MS COCO 2014.
arXiv Detail & Related papers (2022-04-07T04:31:32Z) - Adaptively Enhancing Facial Expression Crucial Regions via Local
Non-Local Joint Network [37.665344656227624]
A local non-local joint network is proposed to adaptively light up the facial crucial regions in feature learning of Facial expression recognition.
The proposed method achieves more competitive performance compared with several state-of-the art methods on five benchmark datasets.
arXiv Detail & Related papers (2022-03-26T10:58:25Z) - BOAT: Bilateral Local Attention Vision Transformer [70.32810772368151]
Early Vision Transformers such as ViT and DeiT adopt global self-attention, which is computationally expensive when the number of patches is large.
Recent Vision Transformers adopt local self-attention mechanisms, where self-attention is computed within local windows.
We propose a Bilateral lOcal Attention vision Transformer (BOAT), which integrates feature-space local attention with image-space local attention.
arXiv Detail & Related papers (2022-01-31T07:09:50Z) - Locally Shifted Attention With Early Global Integration [93.5766619842226]
We propose an approach that allows for coarse global interactions and fine-grained local interactions already at early layers of a vision transformer.
Our method is shown to be superior to both convolutional and transformer-based methods for image classification on CIFAR10, CIFAR100, and ImageNet.
arXiv Detail & Related papers (2021-12-09T18:12:24Z) - RAMS-Trans: Recurrent Attention Multi-scale Transformer forFine-grained
Image Recognition [26.090419694326823]
localization and amplification of region attention is an important factor, which has been explored a lot by convolutional neural networks (CNNs) based approaches.
We propose the recurrent attention multi-scale transformer (RAMS-Trans) which uses the transformer's self-attention to learn discriminative region attention.
arXiv Detail & Related papers (2021-07-17T06:22:20Z) - LocalViT: Bringing Locality to Vision Transformers [132.42018183859483]
locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects.
We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network.
This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks.
arXiv Detail & Related papers (2021-04-12T17:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.