SG-Former: Self-guided Transformer with Evolving Token Reallocation
- URL: http://arxiv.org/abs/2308.12216v1
- Date: Wed, 23 Aug 2023 15:52:45 GMT
- Title: SG-Former: Self-guided Transformer with Evolving Token Reallocation
- Authors: Sucheng Ren, Xingyi Yang, Songhua Liu, Xinchao Wang
- Abstract summary: We propose a novel model, termed as Self-guided Transformer, towards effective global self-attention with adaptive fine granularity.
We assign more tokens to the salient regions for achieving fine-grained attention, while allocating fewer tokens to the minor regions in exchange for efficiency and global receptive fields.
The proposed SG-Former achieves superior performance superior to state of the art: our base size model achieves textbf84.7% Top-1 accuracy on ImageNet-1K, textbf51.2mAP BBAP on CoCo, textbf52.7mIoU
- Score: 89.9363449724261
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformer has demonstrated impressive success across various vision
tasks. However, its heavy computation cost, which grows quadratically with
respect to the token sequence length, largely limits its power in handling
large feature maps. To alleviate the computation cost, previous works rely on
either fine-grained self-attentions restricted to local small regions, or
global self-attentions but to shorten the sequence length resulting in coarse
granularity. In this paper, we propose a novel model, termed as Self-guided
Transformer~(SG-Former), towards effective global self-attention with adaptive
fine granularity. At the heart of our approach is to utilize a significance
map, which is estimated through hybrid-scale self-attention and evolves itself
during training, to reallocate tokens based on the significance of each region.
Intuitively, we assign more tokens to the salient regions for achieving
fine-grained attention, while allocating fewer tokens to the minor regions in
exchange for efficiency and global receptive fields. The proposed SG-Former
achieves performance superior to state of the art: our base size model achieves
\textbf{84.7\%} Top-1 accuracy on ImageNet-1K, \textbf{51.2mAP} bbAP on CoCo,
\textbf{52.7mIoU} on ADE20K surpassing the Swin Transformer by \textbf{+1.3\% /
+2.7 mAP/ +3 mIoU}, with lower computation costs and fewer parameters. The code
is available at
\href{https://github.com/OliverRensu/SG-Former}{https://github.com/OliverRensu/SG-Former}
Related papers
- SGFormer: Simplifying and Empowering Transformers for Large-Graph Representations [75.71298846760303]
We show that a one-layer attention can bring up surprisingly competitive performance across node property prediction benchmarks.
We frame the proposed scheme as Simplified Graph Transformers (SGFormer), which is empowered by a simple attention model.
We believe the proposed methodology alone enlightens a new technical path of independent interest for building Transformers on large graphs.
arXiv Detail & Related papers (2023-06-19T08:03:25Z) - Making Vision Transformers Efficient from A Token Sparsification View [26.42498120556985]
We propose a novel Semantic Token ViT (STViT) for efficient global and local vision transformers.
Our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone.
In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks.
arXiv Detail & Related papers (2023-03-15T15:12:36Z) - Vision Transformer with Super Token Sampling [93.70963123497327]
Vision transformer has achieved impressive performance for many vision tasks.
It may suffer from high redundancy in capturing local features for shallow layers.
Super tokens attempt to provide a semantically meaningful tessellation of visual content.
arXiv Detail & Related papers (2022-11-21T03:48:13Z) - UniFormer: Unifying Convolution and Self-attention for Visual
Recognition [69.68907941116127]
Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years.
We propose a novel Unified transFormer (UniFormer) which seamlessly integrates the merits of convolution and self-attention in a concise transformer format.
Our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K classification.
arXiv Detail & Related papers (2022-01-24T04:39:39Z) - Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped
Attention [28.44439386445018]
We propose a Pale-Shaped self-Attention, which performs self-attention within a pale-shaped region.
Compared to the global self-attention, PS-Attention can reduce the computation and memory costs significantly.
We develop a general Vision Transformer backbone with a hierarchical architecture, named Pale Transformer, which achieves 83.4%, 84.3%, and 84.9% Top-1 accuracy with the model size of 22M, 48M, and 85M respectively.
arXiv Detail & Related papers (2021-12-28T05:37:24Z) - Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks.
We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer.
The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z) - Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
Transformers have demonstrated great potential in computer vision tasks.
Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows.
This design significantly improves the efficiency but lacks global feature reasoning in early stages.
In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
arXiv Detail & Related papers (2021-07-10T02:34:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.