Locally Shifted Attention With Early Global Integration
- URL: http://arxiv.org/abs/2112.05080v1
- Date: Thu, 9 Dec 2021 18:12:24 GMT
- Title: Locally Shifted Attention With Early Global Integration
- Authors: Shelly Sheynin, Sagie Benaim, Adam Polyak, Lior Wolf
- Abstract summary: We propose an approach that allows for coarse global interactions and fine-grained local interactions already at early layers of a vision transformer.
Our method is shown to be superior to both convolutional and transformer-based methods for image classification on CIFAR10, CIFAR100, and ImageNet.
- Score: 93.5766619842226
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent work has shown the potential of transformers for computer vision
applications. An image is first partitioned into patches, which are then used
as input tokens for the attention mechanism. Due to the expensive quadratic
cost of the attention mechanism, either a large patch size is used, resulting
in coarse-grained global interactions, or alternatively, attention is applied
only on a local region of the image, at the expense of long-range interactions.
In this work, we propose an approach that allows for both coarse global
interactions and fine-grained local interactions already at early layers of a
vision transformer.
At the core of our method is the application of local and global attention
layers. In the local attention layer, we apply attention to each patch and its
local shifts, resulting in virtually located local patches, which are not bound
to a single, specific location. These virtually located patches are then used
in a global attention layer. The separation of the attention layer into local
and global counterparts allows for a low computational cost in the number of
patches, while still supporting data-dependent localization already at the
first layer, as opposed to the static positioning in other visual transformers.
Our method is shown to be superior to both convolutional and transformer-based
methods for image classification on CIFAR10, CIFAR100, and ImageNet. Code is
available at: https://github.com/shellysheynin/Locally-SAG-Transformer.
Related papers
- LGFCTR: Local and Global Feature Convolutional Transformer for Image
Matching [8.503217766507584]
A novel convolutional transformer is proposed to capture both local contexts and global structures.
A universal FPN-like framework captures global structures in self-encoder as well as cross-decoder by transformers.
A novel regression-based sub-pixel refinement module exploits the whole fine-grained window features for fine-level positional deviation regression.
arXiv Detail & Related papers (2023-11-29T12:06:19Z) - Accurate Image Restoration with Attention Retractable Transformer [50.05204240159985]
We propose Attention Retractable Transformer (ART) for image restoration.
ART presents both dense and sparse attention modules in the network.
We conduct extensive experiments on image super-resolution, denoising, and JPEG compression artifact reduction tasks.
arXiv Detail & Related papers (2022-10-04T07:35:01Z) - BOAT: Bilateral Local Attention Vision Transformer [70.32810772368151]
Early Vision Transformers such as ViT and DeiT adopt global self-attention, which is computationally expensive when the number of patches is large.
Recent Vision Transformers adopt local self-attention mechanisms, where self-attention is computed within local windows.
We propose a Bilateral lOcal Attention vision Transformer (BOAT), which integrates feature-space local attention with image-space local attention.
arXiv Detail & Related papers (2022-01-31T07:09:50Z) - TransVPR: Transformer-based place recognition with multi-level attention
aggregation [9.087163485833058]
We introduce a novel holistic place recognition model, TransVPR, based on vision Transformers.
TransVPR achieves state-of-the-art performance on several real-world benchmarks.
arXiv Detail & Related papers (2022-01-06T10:20:24Z) - Global and Local Alignment Networks for Unpaired Image-to-Image
Translation [170.08142745705575]
The goal of unpaired image-to-image translation is to produce an output image reflecting the target domain's style.
Due to the lack of attention to the content change in existing methods, semantic information from source images suffers from degradation during translation.
We introduce a novel approach, Global and Local Alignment Networks (GLA-Net)
Our method effectively generates sharper and more realistic images than existing approaches.
arXiv Detail & Related papers (2021-11-19T18:01:54Z) - Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
Transformers have demonstrated great potential in computer vision tasks.
Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows.
This design significantly improves the efficiency but lacks global feature reasoning in early stages.
In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
arXiv Detail & Related papers (2021-07-10T02:34:55Z) - CAT: Cross Attention in Vision Transformer [39.862909079452294]
We propose a new attention mechanism in Transformer called Cross Attention.
It alternates attention inner the image patch instead of the whole image to capture local information.
We build a hierarchical network called Cross Attention Transformer(CAT) for other vision tasks.
arXiv Detail & Related papers (2021-06-10T14:38:32Z) - LocalViT: Bringing Locality to Vision Transformers [132.42018183859483]
locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects.
We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network.
This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks.
arXiv Detail & Related papers (2021-04-12T17:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.