Semantic-Aware Local-Global Vision Transformer
- URL: http://arxiv.org/abs/2211.14705v1
- Date: Sun, 27 Nov 2022 03:16:00 GMT
- Title: Semantic-Aware Local-Global Vision Transformer
- Authors: Jiatong Zhang, Zengwei Yao, Fanglin Chen, Guangming Lu, and Wenjie Pei
- Abstract summary: We propose the Semantic-Aware Local-Global Vision Transformer (SALG)
Our SALG performs semantic segmentation in an unsupervised way to explore the underlying semantic priors in the image.
Our model is able to obtain the global view when learning features for each token.
- Score: 24.55333039729068
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers have achieved remarkable progresses, among which Swin
Transformer has demonstrated the tremendous potential of Transformer for vision
tasks. It surmounts the key challenge of high computational complexity by
performing local self-attention within shifted windows. In this work we propose
the Semantic-Aware Local-Global Vision Transformer (SALG), to further
investigate two potential improvements towards Swin Transformer. First, unlike
Swin Transformer that performs uniform partition to produce equal size of
regular windows for local self-attention, our SALG performs semantic
segmentation in an unsupervised way to explore the underlying semantic priors
in the image. As a result, each segmented region can correspond to a
semantically meaningful part in the image, potentially leading to more
effective features within each of segmented regions. Second, instead of only
performing local self-attention within local windows as Swin Transformer does,
the proposed SALG performs both 1) local intra-region self-attention for
learning fine-grained features within each region and 2) global inter-region
feature propagation for modeling global dependencies among all regions.
Consequently, our model is able to obtain the global view when learning
features for each token, which is the essential advantage of Transformer. Owing
to the explicit modeling of the semantic priors and the proposed local-global
modeling mechanism, our SALG is particularly advantageous for small-scale
models when the modeling capacity is not sufficient for other models to learn
semantics implicitly. Extensive experiments across various vision tasks
demonstrates the merit of our model over other vision Transformers, especially
in the small-scale modeling scenarios.
Related papers
- Leveraging Swin Transformer for Local-to-Global Weakly Supervised
Semantic Segmentation [12.103012959947055]
This work explores the use of Swin Transformer by proposing "SWTformer" to enhance the accuracy of the initial seed CAMs.
SWTformer-V1 achieves a 0.98% mAP higher localization accuracy, outperforming state-of-the-art models.
SWTformer-V2 incorporates a multi-scale feature fusion mechanism to extract additional information.
arXiv Detail & Related papers (2024-01-31T13:41:17Z) - Affine-Consistent Transformer for Multi-Class Cell Nuclei Detection [76.11864242047074]
We propose a novel Affine-Consistent Transformer (AC-Former), which directly yields a sequence of nucleus positions.
We introduce an Adaptive Affine Transformer (AAT) module, which can automatically learn the key spatial transformations to warp original images for local network training.
Experimental results demonstrate that the proposed method significantly outperforms existing state-of-the-art algorithms on various benchmarks.
arXiv Detail & Related papers (2023-10-22T02:27:02Z) - Lightweight Vision Transformer with Bidirectional Interaction [63.65115590184169]
We propose a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information.
Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family.
arXiv Detail & Related papers (2023-06-01T06:56:41Z) - Full Contextual Attention for Multi-resolution Transformers in Semantic
Segmentation [76.93387214103863]
This paper extends the notion of global tokens to build GLobal Attention Multi-resolution (GLAM) transformers.
GLAM includes learnable global tokens, which unlike previous methods can model interactions between all image regions.
Experiments show GLAM-Swin or GLAM-Swin-UNet exhibit substantially better performances than their vanilla counterparts on ADE20K and Cityscapes.
arXiv Detail & Related papers (2022-12-15T15:19:09Z) - MAFormer: A Transformer Network with Multi-scale Attention Fusion for
Visual Recognition [45.68567088645708]
We introduce Multi-scale Attention Fusion into transformer (MAFormer)
MAFormer explores local aggregation and global feature extraction in a dual-stream framework for visual recognition.
Our MAFormer achieves state-of-the-art performance on common vision tasks.
arXiv Detail & Related papers (2022-08-31T06:29:27Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
Transformers have demonstrated great potential in computer vision tasks.
Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows.
This design significantly improves the efficiency but lacks global feature reasoning in early stages.
In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
arXiv Detail & Related papers (2021-07-10T02:34:55Z) - LocalViT: Bringing Locality to Vision Transformers [132.42018183859483]
locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects.
We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network.
This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks.
arXiv Detail & Related papers (2021-04-12T17:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.