Axially Expanded Windows for Local-Global Interaction in Vision
Transformers
- URL: http://arxiv.org/abs/2209.08726v1
- Date: Mon, 19 Sep 2022 02:53:07 GMT
- Title: Axially Expanded Windows for Local-Global Interaction in Vision
Transformers
- Authors: Zhemin Zhang, Xun Gong
- Abstract summary: Global self-attention is very expensive to compute, especially for the high-resolution vision tasks.
We develop an axially expanded window self-attention mechanism that performs fine-grained self-attention within the local window and coarse-grained self-attention in the horizontal and vertical axes.
- Score: 1.583842747998493
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Transformers have shown promising performance in various vision
tasks. A challenging issue in Transformer design is that global self-attention
is very expensive to compute, especially for the high-resolution vision tasks.
Local self-attention performs attention computation within a local region to
improve its efficiency, which leads to their receptive fields in a single
attention layer are not large enough, resulting in insufficient context
modeling. When observing a scene, humans usually focus on a local region while
attending to non-attentional regions at coarse granularity. Based on this
observation, we develop the axially expanded window self-attention mechanism
that performs fine-grained self-attention within the local window and
coarse-grained self-attention in the horizontal and vertical axes, and thus can
effectively capturing both short- and long-range visual dependencies.
Related papers
- LocalEyenet: Deep Attention framework for Localization of Eyes [0.609170287691728]
We have proposed a deep coarse-to-fine architecture called LocalEyenet for localization of only the eye regions that can be trained end-to-end.
Our model shows good generalization ability in cross-dataset evaluation and in real-time localization of eyes.
arXiv Detail & Related papers (2023-03-13T06:35:45Z) - Semantic-Aware Local-Global Vision Transformer [24.55333039729068]
We propose the Semantic-Aware Local-Global Vision Transformer (SALG)
Our SALG performs semantic segmentation in an unsupervised way to explore the underlying semantic priors in the image.
Our model is able to obtain the global view when learning features for each token.
arXiv Detail & Related papers (2022-11-27T03:16:00Z) - Boosting Crowd Counting via Multifaceted Attention [109.89185492364386]
Large-scale variations often exist within crowd images.
Neither fixed-size convolution kernel of CNN nor fixed-size attention of recent vision transformers can handle this kind of variation.
We propose a Multifaceted Attention Network (MAN) to improve transformer models in local spatial relation encoding.
arXiv Detail & Related papers (2022-03-05T01:36:43Z) - BOAT: Bilateral Local Attention Vision Transformer [70.32810772368151]
Early Vision Transformers such as ViT and DeiT adopt global self-attention, which is computationally expensive when the number of patches is large.
Recent Vision Transformers adopt local self-attention mechanisms, where self-attention is computed within local windows.
We propose a Bilateral lOcal Attention vision Transformer (BOAT), which integrates feature-space local attention with image-space local attention.
arXiv Detail & Related papers (2022-01-31T07:09:50Z) - TransVPR: Transformer-based place recognition with multi-level attention
aggregation [9.087163485833058]
We introduce a novel holistic place recognition model, TransVPR, based on vision Transformers.
TransVPR achieves state-of-the-art performance on several real-world benchmarks.
arXiv Detail & Related papers (2022-01-06T10:20:24Z) - Locally Shifted Attention With Early Global Integration [93.5766619842226]
We propose an approach that allows for coarse global interactions and fine-grained local interactions already at early layers of a vision transformer.
Our method is shown to be superior to both convolutional and transformer-based methods for image classification on CIFAR10, CIFAR100, and ImageNet.
arXiv Detail & Related papers (2021-12-09T18:12:24Z) - Alignment Attention by Matching Key and Query Distributions [48.93793773929006]
This paper introduces alignment attention that explicitly encourages self-attention to match the distributions of the key and query within each head.
It is simple to convert any models with self-attention, including pre-trained ones, to the proposed alignment attention.
On a variety of language understanding tasks, we show the effectiveness of our method in accuracy, uncertainty estimation, generalization across domains, and robustness to adversarial attacks.
arXiv Detail & Related papers (2021-10-25T00:54:57Z) - Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
Transformers have demonstrated great potential in computer vision tasks.
Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows.
This design significantly improves the efficiency but lacks global feature reasoning in early stages.
In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
arXiv Detail & Related papers (2021-07-10T02:34:55Z) - LocalViT: Bringing Locality to Vision Transformers [132.42018183859483]
locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects.
We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network.
This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks.
arXiv Detail & Related papers (2021-04-12T17:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.