BOAT: Bilateral Local Attention Vision Transformer
- URL: http://arxiv.org/abs/2201.13027v1
- Date: Mon, 31 Jan 2022 07:09:50 GMT
- Title: BOAT: Bilateral Local Attention Vision Transformer
- Authors: Tan Yu, Gangming Zhao, Ping Li, Yizhou Yu
- Abstract summary: Early Vision Transformers such as ViT and DeiT adopt global self-attention, which is computationally expensive when the number of patches is large.
Recent Vision Transformers adopt local self-attention mechanisms, where self-attention is computed within local windows.
We propose a Bilateral lOcal Attention vision Transformer (BOAT), which integrates feature-space local attention with image-space local attention.
- Score: 70.32810772368151
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers achieved outstanding performance in many computer vision
tasks. Early Vision Transformers such as ViT and DeiT adopt global
self-attention, which is computationally expensive when the number of patches
is large. To improve efficiency, recent Vision Transformers adopt local
self-attention mechanisms, where self-attention is computed within local
windows. Despite the fact that window-based local self-attention significantly
boosts efficiency, it fails to capture the relationships between distant but
similar patches in the image plane. To overcome this limitation of image-space
local attention, in this paper, we further exploit the locality of patches in
the feature space. We group the patches into multiple clusters using their
features, and self-attention is computed within every cluster. Such
feature-space local attention effectively captures the connections between
patches across different local windows but still relevant. We propose a
Bilateral lOcal Attention vision Transformer (BOAT), which integrates
feature-space local attention with image-space local attention. We further
integrate BOAT with both Swin and CSWin models, and extensive experiments on
several benchmark datasets demonstrate that our BOAT-CSWin model clearly and
consistently outperforms existing state-of-the-art CNN models and vision
Transformers.
Related papers
- Fusion of regional and sparse attention in Vision Transformers [4.782322901897837]
Modern vision transformers leverage visually inspired local interaction between pixels through attention computed within window or grid regions.
We propose Atrous Attention, a blend of regional and sparse attention that dynamically integrates both local and global information.
Our compact model achieves approximately 84% accuracy on ImageNet-1K with fewer than 28.5 million parameters, outperforming the state-of-the-art MaxViT by 0.42%.
arXiv Detail & Related papers (2024-06-13T06:48:25Z) - Lightweight Vision Transformer with Bidirectional Interaction [63.65115590184169]
We propose a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information.
Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family.
arXiv Detail & Related papers (2023-06-01T06:56:41Z) - AxWin Transformer: A Context-Aware Vision Transformer Backbone with
Axial Windows [4.406336825345075]
Recently Transformer has shown good performance in several vision tasks due to its powerful modeling capabilities.
We propose AxWin Attention, which models context information in both local windows and axial views.
Based on the AxWin Attention, we develop a context-aware vision transformer backbone, named AxWin Transformer.
arXiv Detail & Related papers (2023-05-02T09:33:11Z) - Axially Expanded Windows for Local-Global Interaction in Vision
Transformers [1.583842747998493]
Global self-attention is very expensive to compute, especially for the high-resolution vision tasks.
We develop an axially expanded window self-attention mechanism that performs fine-grained self-attention within the local window and coarse-grained self-attention in the horizontal and vertical axes.
arXiv Detail & Related papers (2022-09-19T02:53:07Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - Locally Shifted Attention With Early Global Integration [93.5766619842226]
We propose an approach that allows for coarse global interactions and fine-grained local interactions already at early layers of a vision transformer.
Our method is shown to be superior to both convolutional and transformer-based methods for image classification on CIFAR10, CIFAR100, and ImageNet.
arXiv Detail & Related papers (2021-12-09T18:12:24Z) - Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
Transformers have demonstrated great potential in computer vision tasks.
Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows.
This design significantly improves the efficiency but lacks global feature reasoning in early stages.
In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
arXiv Detail & Related papers (2021-07-10T02:34:55Z) - LocalViT: Bringing Locality to Vision Transformers [132.42018183859483]
locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects.
We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network.
This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks.
arXiv Detail & Related papers (2021-04-12T17:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.