MSG-Transformer: Exchanging Local Spatial Information by Manipulating
Messenger Tokens
- URL: http://arxiv.org/abs/2105.15168v1
- Date: Mon, 31 May 2021 17:16:42 GMT
- Title: MSG-Transformer: Exchanging Local Spatial Information by Manipulating
Messenger Tokens
- Authors: Jiemin Fang, Lingxi Xie, Xinggang Wang, Xiaopeng Zhang, Wenyu Liu, Qi
Tian
- Abstract summary: We propose a specialized token for each region that serves as a messenger (MSG)
By manipulating these MSG tokens, one can flexibly exchange visual information across regions.
We then integrate the MSG token into a multi-scale architecture named MSG-Transformer.
- Score: 129.10351459066501
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have offered a new methodology of designing neural networks for
visual recognition. Compared to convolutional networks, Transformers enjoy the
ability of referring to global features at each stage, yet the attention module
brings higher computational overhead that obstructs the application of
Transformers to process high-resolution visual data. This paper aims to
alleviate the conflict between efficiency and flexibility, for which we propose
a specialized token for each region that serves as a messenger (MSG). Hence, by
manipulating these MSG tokens, one can flexibly exchange visual information
across regions and the computational complexity is reduced. We then integrate
the MSG token into a multi-scale architecture named MSG-Transformer. In
standard image classification and object detection, MSG-Transformer achieves
competitive performance and the inference on both GPU and CPU is accelerated.
The code will be available at https://github.com/hustvl/MSG-Transformer.
Related papers
- GTC: GNN-Transformer Co-contrastive Learning for Self-supervised Heterogeneous Graph Representation [0.9249657468385781]
This paper proposes a collaborative learning scheme for GNN-Transformer and constructs GTC architecture.
For the Transformer branch, we propose Metapath-aware Hop2Token and CG-Hetphormer, which can cooperate with GNN to attentively encode neighborhood information from different levels.
Experiments on real datasets show that GTC exhibits superior performance compared with state-of-the-art methods.
arXiv Detail & Related papers (2024-03-22T12:22:44Z) - SGFormer: Simplifying and Empowering Transformers for Large-Graph Representations [75.71298846760303]
We show that a one-layer attention can bring up surprisingly competitive performance across node property prediction benchmarks.
We frame the proposed scheme as Simplified Graph Transformers (SGFormer), which is empowered by a simple attention model.
We believe the proposed methodology alone enlightens a new technical path of independent interest for building Transformers on large graphs.
arXiv Detail & Related papers (2023-06-19T08:03:25Z) - Vision Transformer with Quadrangle Attention [76.35955924137986]
We propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation.
Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles.
We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost.
arXiv Detail & Related papers (2023-03-27T11:13:50Z) - Full Contextual Attention for Multi-resolution Transformers in Semantic
Segmentation [76.93387214103863]
This paper extends the notion of global tokens to build GLobal Attention Multi-resolution (GLAM) transformers.
GLAM includes learnable global tokens, which unlike previous methods can model interactions between all image regions.
Experiments show GLAM-Swin or GLAM-Swin-UNet exhibit substantially better performances than their vanilla counterparts on ADE20K and Cityscapes.
arXiv Detail & Related papers (2022-12-15T15:19:09Z) - Transformer Scale Gate for Semantic Segmentation [53.27673119360868]
Transformer Scale Gate (TSG) exploits cues in self and cross attentions in Vision Transformers for the scale selection.
Our experiments on the Pascal Context and ADE20K datasets demonstrate that our feature selection strategy achieves consistent gains.
arXiv Detail & Related papers (2022-05-14T13:11:39Z) - Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
Transformers have demonstrated great potential in computer vision tasks.
Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows.
This design significantly improves the efficiency but lacks global feature reasoning in early stages.
In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
arXiv Detail & Related papers (2021-07-10T02:34:55Z) - Fully Transformer Networks for Semantic ImageSegmentation [26.037770622551882]
We explore a novel framework for semantic image segmentation, which is encoder-decoder based Fully Transformer Networks (FTN)
We propose a Pyramid Group Transformer (PGT) as the encoder for progressively learning hierarchical features, while reducing the computation complexity of the standard visual transformer(ViT)
Then, we propose a Feature Pyramid Transformer (FPT) to fuse semantic-level and spatial-level information from multiple levels of the PGT encoder for semantic image segmentation.
arXiv Detail & Related papers (2021-06-08T05:15:28Z) - Glance-and-Gaze Vision Transformer [13.77016463781053]
We propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer)
It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes.
We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers.
arXiv Detail & Related papers (2021-06-04T06:13:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.