Fusion of regional and sparse attention in Vision Transformers
- URL: http://arxiv.org/abs/2406.08859v1
- Date: Thu, 13 Jun 2024 06:48:25 GMT
- Title: Fusion of regional and sparse attention in Vision Transformers
- Authors: Nabil Ibtehaz, Ning Yan, Masood Mortazavi, Daisuke Kihara,
- Abstract summary: Modern vision transformers leverage visually inspired local interaction between pixels through attention computed within window or grid regions.
We propose Atrous Attention, a blend of regional and sparse attention that dynamically integrates both local and global information.
Our compact model achieves approximately 84% accuracy on ImageNet-1K with fewer than 28.5 million parameters, outperforming the state-of-the-art MaxViT by 0.42%.
- Score: 4.782322901897837
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Modern vision transformers leverage visually inspired local interaction between pixels through attention computed within window or grid regions, in contrast to the global attention employed in the original ViT. Regional attention restricts pixel interactions within specific regions, while sparse attention disperses them across sparse grids. These differing approaches pose a challenge between maintaining hierarchical relationships vs. capturing a global context. In this study, drawing inspiration from atrous convolution, we propose Atrous Attention, a blend of regional and sparse attention that dynamically integrates both local and global information while preserving hierarchical structures. Based on this, we introduce a versatile, hybrid vision transformer backbone called ACC-ViT, tailored for standard vision tasks. Our compact model achieves approximately 84% accuracy on ImageNet-1K with fewer than 28.5 million parameters, outperforming the state-of-the-art MaxViT by 0.42% while requiring 8.4% fewer parameters.
Related papers
- ACC-ViT : Atrous Convolution's Comeback in Vision Transformers [5.224344210588584]
We introduce Atrous Attention, a fusion of regional and sparse attention, which can adaptively consolidate both local and global information.
We also propose a general vision transformer backbone, named ACC-ViT, following conventional practices for standard vision tasks.
ACC-ViT is therefore a strong vision backbone, which is also competitive in mobile-scale versions, ideal for niche applications with small datasets.
arXiv Detail & Related papers (2024-03-07T04:05:16Z) - Lightweight Vision Transformer with Bidirectional Interaction [63.65115590184169]
We propose a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information.
Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family.
arXiv Detail & Related papers (2023-06-01T06:56:41Z) - Global Context Vision Transformers [78.5346173956383]
We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision.
We address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture.
Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks.
arXiv Detail & Related papers (2022-06-20T18:42:44Z) - DaViT: Dual Attention Vision Transformers [94.62855697081079]
We introduce Dual Attention Vision Transformers (DaViT)
DaViT is a vision transformer architecture that is able to capture global context while maintaining computational efficiency.
We show our DaViT achieves state-of-the-art performance on four different tasks with efficient computations.
arXiv Detail & Related papers (2022-04-07T17:59:32Z) - BOAT: Bilateral Local Attention Vision Transformer [70.32810772368151]
Early Vision Transformers such as ViT and DeiT adopt global self-attention, which is computationally expensive when the number of patches is large.
Recent Vision Transformers adopt local self-attention mechanisms, where self-attention is computed within local windows.
We propose a Bilateral lOcal Attention vision Transformer (BOAT), which integrates feature-space local attention with image-space local attention.
arXiv Detail & Related papers (2022-01-31T07:09:50Z) - RAMS-Trans: Recurrent Attention Multi-scale Transformer forFine-grained
Image Recognition [26.090419694326823]
localization and amplification of region attention is an important factor, which has been explored a lot by convolutional neural networks (CNNs) based approaches.
We propose the recurrent attention multi-scale transformer (RAMS-Trans) which uses the transformer's self-attention to learn discriminative region attention.
arXiv Detail & Related papers (2021-07-17T06:22:20Z) - Focal Self-attention for Local-Global Interactions in Vision
Transformers [90.9169644436091]
We present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions.
With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers.
arXiv Detail & Related papers (2021-07-01T17:56:09Z) - RegionViT: Regional-to-Local Attention for Vision Transformers [17.70988054450176]
Vision transformer (ViT) has recently showed its strong capability in achieving comparable results to convolutional neural networks (CNNs) on image classification.
We propose a new architecture that adopts the pyramid structure and employ a novel regional-to-local attention.
Our approach outperforms or is on par with state-of-the-art ViT variants including many concurrent works.
arXiv Detail & Related papers (2021-06-04T19:57:11Z) - LocalViT: Bringing Locality to Vision Transformers [132.42018183859483]
locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects.
We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network.
This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks.
arXiv Detail & Related papers (2021-04-12T17:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.