Related papers: Locally Shifted Attention With Early Global Integration

Locally Shifted Attention With Early Global Integration

URL: http://arxiv.org/abs/2112.05080v1
Date: Thu, 9 Dec 2021 18:12:24 GMT
Title: Locally Shifted Attention With Early Global Integration
Authors: Shelly Sheynin, Sagie Benaim, Adam Polyak, Lior Wolf
Abstract summary: We propose an approach that allows for coarse global interactions and fine-grained local interactions already at early layers of a vision transformer. Our method is shown to be superior to both convolutional and transformer-based methods for image classification on CIFAR10, CIFAR100, and ImageNet.
Score: 93.5766619842226
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent work has shown the potential of transformers for computer vision applications. An image is first partitioned into patches, which are then used as input tokens for the attention mechanism. Due to the expensive quadratic cost of the attention mechanism, either a large patch size is used, resulting in coarse-grained global interactions, or alternatively, attention is applied only on a local region of the image, at the expense of long-range interactions. In this work, we propose an approach that allows for both coarse global interactions and fine-grained local interactions already at early layers of a vision transformer. At the core of our method is the application of local and global attention layers. In the local attention layer, we apply attention to each patch and its local shifts, resulting in virtually located local patches, which are not bound to a single, specific location. These virtually located patches are then used in a global attention layer. The separation of the attention layer into local and global counterparts allows for a low computational cost in the number of patches, while still supporting data-dependent localization already at the first layer, as opposed to the static positioning in other visual transformers. Our method is shown to be superior to both convolutional and transformer-based methods for image classification on CIFAR10, CIFAR100, and ImageNet. Code is available at: https://github.com/shellysheynin/Locally-SAG-Transformer.

Related papers

LGFCTR: Local and Global Feature Convolutional Transformer for Image Matching [8.503217766507584]
A novel convolutional transformer is proposed to capture both local contexts and global structures. A universal FPN-like framework captures global structures in self-encoder as well as cross-decoder by transformers. A novel regression-based sub-pixel refinement module exploits the whole fine-grained window features for fine-level positional deviation regression.
arXiv Detail & Related papers (2023-11-29T12:06:19Z)
Accurate Image Restoration with Attention Retractable Transformer [50.05204240159985]
We propose Attention Retractable Transformer (ART) for image restoration. ART presents both dense and sparse attention modules in the network. We conduct extensive experiments on image super-resolution, denoising, and JPEG compression artifact reduction tasks.
arXiv Detail & Related papers (2022-10-04T07:35:01Z)
BOAT: Bilateral Local Attention Vision Transformer [70.32810772368151]
Early Vision Transformers such as ViT and DeiT adopt global self-attention, which is computationally expensive when the number of patches is large. Recent Vision Transformers adopt local self-attention mechanisms, where self-attention is computed within local windows. We propose a Bilateral lOcal Attention vision Transformer (BOAT), which integrates feature-space local attention with image-space local attention.
arXiv Detail & Related papers (2022-01-31T07:09:50Z)
TransVPR: Transformer-based place recognition with multi-level attention aggregation [9.087163485833058]
We introduce a novel holistic place recognition model, TransVPR, based on vision Transformers. TransVPR achieves state-of-the-art performance on several real-world benchmarks.
arXiv Detail & Related papers (2022-01-06T10:20:24Z)
LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization [38.376238216214524]
Weakly supervised object localization (WSOL) aims to learn object localizer solely by using image-level labels. We propose a novel framework built upon the transformer, termed LCTR, which targets at enhancing the local perception capability of global features.
arXiv Detail & Related papers (2021-12-10T01:48:40Z)
Global and Local Alignment Networks for Unpaired Image-to-Image Translation [170.08142745705575]
The goal of unpaired image-to-image translation is to produce an output image reflecting the target domain's style. Due to the lack of attention to the content change in existing methods, semantic information from source images suffers from degradation during translation. We introduce a novel approach, Global and Local Alignment Networks (GLA-Net) Our method effectively generates sharper and more realistic images than existing approaches.
arXiv Detail & Related papers (2021-11-19T18:01:54Z)
Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
Transformers have demonstrated great potential in computer vision tasks. Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows. This design significantly improves the efficiency but lacks global feature reasoning in early stages. In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
arXiv Detail & Related papers (2021-07-10T02:34:55Z)
CAT: Cross Attention in Vision Transformer [39.862909079452294]
We propose a new attention mechanism in Transformer called Cross Attention. It alternates attention inner the image patch instead of the whole image to capture local information. We build a hierarchical network called Cross Attention Transformer(CAT) for other vision tasks.
arXiv Detail & Related papers (2021-06-10T14:38:32Z)
LocalViT: Bringing Locality to Vision Transformers [132.42018183859483]
locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects. We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks.
arXiv Detail & Related papers (2021-04-12T17:59:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.