Full Contextual Attention for Multi-resolution Transformers in Semantic
Segmentation
- URL: http://arxiv.org/abs/2212.07890v1
- Date: Thu, 15 Dec 2022 15:19:09 GMT
- Title: Full Contextual Attention for Multi-resolution Transformers in Semantic
Segmentation
- Authors: Loic Themyr, Clement Rambour, Nicolas Thome, Toby Collins, Alexandre
Hostettler
- Abstract summary: This paper extends the notion of global tokens to build GLobal Attention Multi-resolution (GLAM) transformers.
GLAM includes learnable global tokens, which unlike previous methods can model interactions between all image regions.
Experiments show GLAM-Swin or GLAM-Swin-UNet exhibit substantially better performances than their vanilla counterparts on ADE20K and Cityscapes.
- Score: 76.93387214103863
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have proved to be very effective for visual recognition tasks.
In particular, vision transformers construct compressed global representations
through self-attention and learnable class tokens. Multi-resolution
transformers have shown recent successes in semantic segmentation but can only
capture local interactions in high-resolution feature maps. This paper extends
the notion of global tokens to build GLobal Attention Multi-resolution (GLAM)
transformers. GLAM is a generic module that can be integrated into most
existing transformer backbones. GLAM includes learnable global tokens, which
unlike previous methods can model interactions between all image regions, and
extracts powerful representations during training. Extensive experiments show
that GLAM-Swin or GLAM-Swin-UNet exhibit substantially better performances than
their vanilla counterparts on ADE20K and Cityscapes. Moreover, GLAM can be used
to segment large 3D medical images, and GLAM-nnFormer achieves new
state-of-the-art performance on the BCV dataset.
Related papers
- Other Tokens Matter: Exploring Global and Local Features of Vision Transformers for Object Re-Identification [63.147482497821166]
We first explore the influence of global and local features of ViT and then propose a novel Global-Local Transformer (GLTrans) for high-performance object Re-ID.
Our proposed method achieves superior performance on four object Re-ID benchmarks.
arXiv Detail & Related papers (2024-04-23T12:42:07Z) - Multi-scale Efficient Graph-Transformer for Whole Slide Image
Classification [16.19677745296922]
We propose a novel Multi-scale Efficient Graph-Transformer (MEGT) framework for WSI classification.
The key idea of MEGT is to adopt two independent Efficient Graph-based Transformer (EGT) branches to process the low-resolution and high-resolution patch embeddings.
We propose a novel MFFM to alleviate the semantic gap among different resolution patches during feature fusion.
arXiv Detail & Related papers (2023-05-25T06:34:14Z) - Holistically Explainable Vision Transformers [136.27303006772294]
We propose B-cos transformers, which inherently provide holistic explanations for their decisions.
Specifically, we formulate each model component - such as the multi-layer perceptrons, attention layers, and the tokenisation module - to be dynamic linear.
We apply our proposed design to Vision Transformers (ViTs) and show that the resulting models, dubbed Bcos-ViTs, are highly interpretable and perform competitively to baseline ViTs.
arXiv Detail & Related papers (2023-01-20T16:45:34Z) - Semantic-Aware Local-Global Vision Transformer [24.55333039729068]
We propose the Semantic-Aware Local-Global Vision Transformer (SALG)
Our SALG performs semantic segmentation in an unsupervised way to explore the underlying semantic priors in the image.
Our model is able to obtain the global view when learning features for each token.
arXiv Detail & Related papers (2022-11-27T03:16:00Z) - MAFormer: A Transformer Network with Multi-scale Attention Fusion for
Visual Recognition [45.68567088645708]
We introduce Multi-scale Attention Fusion into transformer (MAFormer)
MAFormer explores local aggregation and global feature extraction in a dual-stream framework for visual recognition.
Our MAFormer achieves state-of-the-art performance on common vision tasks.
arXiv Detail & Related papers (2022-08-31T06:29:27Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - Transformer Scale Gate for Semantic Segmentation [53.27673119360868]
Transformer Scale Gate (TSG) exploits cues in self and cross attentions in Vision Transformers for the scale selection.
Our experiments on the Pascal Context and ADE20K datasets demonstrate that our feature selection strategy achieves consistent gains.
arXiv Detail & Related papers (2022-05-14T13:11:39Z) - Glance-and-Gaze Vision Transformer [13.77016463781053]
We propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer)
It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes.
We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers.
arXiv Detail & Related papers (2021-06-04T06:13:47Z) - MSG-Transformer: Exchanging Local Spatial Information by Manipulating
Messenger Tokens [129.10351459066501]
We propose a specialized token for each region that serves as a messenger (MSG)
By manipulating these MSG tokens, one can flexibly exchange visual information across regions.
We then integrate the MSG token into a multi-scale architecture named MSG-Transformer.
arXiv Detail & Related papers (2021-05-31T17:16:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.