Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped
Attention
- URL: http://arxiv.org/abs/2112.14000v1
- Date: Tue, 28 Dec 2021 05:37:24 GMT
- Title: Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped
Attention
- Authors: Sitong Wu, Tianyi Wu, Haoru Tan, Guodong Guo
- Abstract summary: We propose a Pale-Shaped self-Attention, which performs self-attention within a pale-shaped region.
Compared to the global self-attention, PS-Attention can reduce the computation and memory costs significantly.
We develop a general Vision Transformer backbone with a hierarchical architecture, named Pale Transformer, which achieves 83.4%, 84.3%, and 84.9% Top-1 accuracy with the model size of 22M, 48M, and 85M respectively.
- Score: 28.44439386445018
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, Transformers have shown promising performance in various vision
tasks. To reduce the quadratic computation complexity caused by the global
self-attention, various methods constrain the range of attention within a local
region to improve its efficiency. Consequently, their receptive fields in a
single attention layer are not large enough, resulting in insufficient context
modeling. To address this issue, we propose a Pale-Shaped self-Attention
(PS-Attention), which performs self-attention within a pale-shaped region.
Compared to the global self-attention, PS-Attention can reduce the computation
and memory costs significantly. Meanwhile, it can capture richer contextual
information under the similar computation complexity with previous local
self-attention mechanisms. Based on the PS-Attention, we develop a general
Vision Transformer backbone with a hierarchical architecture, named Pale
Transformer, which achieves 83.4%, 84.3%, and 84.9% Top-1 accuracy with the
model size of 22M, 48M, and 85M respectively for 224 ImageNet-1K
classification, outperforming the previous Vision Transformer backbones. For
downstream tasks, our Pale Transformer backbone performs better than the recent
state-of-the-art CSWin Transformer by a large margin on ADE20K semantic
segmentation and COCO object detection & instance segmentation. The code will
be released on https://github.com/BR-IDL/PaddleViT.
Related papers
- Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST)
CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background.
Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z) - Global Context Vision Transformers [78.5346173956383]
We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision.
We address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture.
Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks.
arXiv Detail & Related papers (2022-06-20T18:42:44Z) - MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet [55.16833099336073]
We propose to self-distill a Transformer-based UNet for medical image segmentation.
It simultaneously learns global semantic information and local spatial-detailed features.
Our MISSU achieves the best performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2022-06-02T07:38:53Z) - Vision Transformer with Deformable Attention [29.935891419574602]
Large, sometimes even global, receptive field endows Transformer models with higher representation power over their CNN counterparts.
We propose a novel deformable self-attention module, where the positions of key and value pairs in self-attention are selected in a data-dependent way.
We present Deformable Attention Transformer, a general backbone model with deformable attention for both image classification and dense prediction tasks.
arXiv Detail & Related papers (2022-01-03T08:29:01Z) - Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
Transformers have demonstrated great potential in computer vision tasks.
Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows.
This design significantly improves the efficiency but lacks global feature reasoning in early stages.
In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
arXiv Detail & Related papers (2021-07-10T02:34:55Z) - CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows [99.36226415086243]
We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks.
A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
arXiv Detail & Related papers (2021-07-01T17:59:56Z) - Focal Self-attention for Local-Global Interactions in Vision
Transformers [90.9169644436091]
We present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions.
With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers.
arXiv Detail & Related papers (2021-07-01T17:56:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.