Related papers: Local-to-Global Self-Attention in Vision Transformers

Local-to-Global Self-Attention in Vision Transformers

URL: http://arxiv.org/abs/2107.04735v1
Date: Sat, 10 Jul 2021 02:34:55 GMT
Title: Local-to-Global Self-Attention in Vision Transformers
Authors: Jinpeng Li, Yichao Yan, Shengcai Liao, Xiaokang Yang, Ling Shao
Abstract summary: Transformers have demonstrated great potential in computer vision tasks. Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows. This design significantly improves the efficiency but lacks global feature reasoning in early stages. In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
Score: 130.0369761612812
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers have demonstrated great potential in computer vision tasks. To avoid dense computations of self-attentions in high-resolution visual data, some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows. This design significantly improves the efficiency but lacks global feature reasoning in early stages. In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage. The proposed framework is computationally efficient and highly effective. With a marginal increasement in computational overhead, our model achieves notable improvements in both image classification and semantic segmentation. Code is available at https://github.com/ljpadam/LG-Transformer

Related papers

An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding [17.855998090452058]
We propose an efficient and effective multi-task visual grounding framework based on Transformer Decoder. In the language aspect, we employ the Transformer Decoder to fuse visual and linguistic features, where linguistic features are input as memory and visual features as queries. In the visual aspect, we introduce a parameter-free approach to reduce computation by eliminating background visual tokens based on attention scores.
arXiv Detail & Related papers (2024-08-02T09:01:05Z)
SGFormer: Simplifying and Empowering Transformers for Large-Graph Representations [75.71298846760303]
We show that a one-layer attention can bring up surprisingly competitive performance across node property prediction benchmarks. We frame the proposed scheme as Simplified Graph Transformers (SGFormer), which is empowered by a simple attention model. We believe the proposed methodology alone enlightens a new technical path of independent interest for building Transformers on large graphs.
arXiv Detail & Related papers (2023-06-19T08:03:25Z)
Vision Transformer with Quadrangle Attention [76.35955924137986]
We propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation. Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles. We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost.
arXiv Detail & Related papers (2023-03-27T11:13:50Z)
Semantic-Aware Local-Global Vision Transformer [24.55333039729068]
We propose the Semantic-Aware Local-Global Vision Transformer (SALG) Our SALG performs semantic segmentation in an unsupervised way to explore the underlying semantic priors in the image. Our model is able to obtain the global view when learning features for each token.
arXiv Detail & Related papers (2022-11-27T03:16:00Z)
SSformer: A Lightweight Transformer for Semantic Segmentation [7.787950060560868]
Swin Transformer set a new record in various vision tasks by using hierarchical architecture and shifted windows. We design a lightweight yet effective transformer model, called SSformer. Experimental results show the proposed SSformer yields comparable mIoU performance with state-of-the-art models.
arXiv Detail & Related papers (2022-08-03T12:57:00Z)
MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet [55.16833099336073]
We propose to self-distill a Transformer-based UNet for medical image segmentation. It simultaneously learns global semantic information and local spatial-detailed features. Our MISSU achieves the best performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2022-06-02T07:38:53Z)
Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention [28.44439386445018]
We propose a Pale-Shaped self-Attention, which performs self-attention within a pale-shaped region. Compared to the global self-attention, PS-Attention can reduce the computation and memory costs significantly. We develop a general Vision Transformer backbone with a hierarchical architecture, named Pale Transformer, which achieves 83.4%, 84.3%, and 84.9% Top-1 accuracy with the model size of 22M, 48M, and 85M respectively.
arXiv Detail & Related papers (2021-12-28T05:37:24Z)
Glance-and-Gaze Vision Transformer [13.77016463781053]
We propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer) It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes. We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers.
arXiv Detail & Related papers (2021-06-04T06:13:47Z)
MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens [129.10351459066501]
We propose a specialized token for each region that serves as a messenger (MSG) By manipulating these MSG tokens, one can flexibly exchange visual information across regions. We then integrate the MSG token into a multi-scale architecture named MSG-Transformer.
arXiv Detail & Related papers (2021-05-31T17:16:42Z)
LocalViT: Bringing Locality to Vision Transformers [132.42018183859483]
locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects. We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks.
arXiv Detail & Related papers (2021-04-12T17:59:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.