CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows
- URL: http://arxiv.org/abs/2107.00652v1
- Date: Thu, 1 Jul 2021 17:59:56 GMT
- Title: CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows
- Authors: Xiaoyi Dong and Jianmin Bao and Dongdong Chen and Weiming Zhang and
Nenghai Yu and Lu Yuan and Dong Chen and Baining Guo
- Abstract summary: We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks.
A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
- Score: 99.36226415086243
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present CSWin Transformer, an efficient and effective Transformer-based
backbone for general-purpose vision tasks. A challenging issue in Transformer
design is that global self-attention is very expensive to compute whereas local
self-attention often limits the field of interactions of each token. To address
this issue, we develop the Cross-Shaped Window self-attention mechanism for
computing self-attention in the horizontal and vertical stripes in parallel
that form a cross-shaped window, with each stripe obtained by splitting the
input feature into stripes of equal width. We provide a detailed mathematical
analysis of the effect of the stripe width and vary the stripe width for
different layers of the Transformer network which achieves strong modeling
capability while limiting the computation cost. We also introduce
Locally-enhanced Positional Encoding (LePE), which handles the local positional
information better than existing encoding schemes. LePE naturally supports
arbitrary input resolutions, and is thus especially effective and friendly for
downstream tasks. Incorporated with these designs and a hierarchical structure,
CSWin Transformer demonstrates competitive performance on common vision tasks.
Specifically, it achieves 85.4% Top-1 accuracy on ImageNet-1K without any extra
training data or label, 53.9 box AP and 46.4 mask AP on the COCO detection
task, and 51.7 mIOU on the ADE20K semantic segmentation task, surpassing
previous state-of-the-art Swin Transformer backbone by +1.2, +2.0, +1.4, and
+2.0 respectively under the similar FLOPs setting. By further pretraining on
the larger dataset ImageNet-21K, we achieve 87.5% Top-1 accuracy on ImageNet-1K
and state-of-the-art segmentation performance on ADE20K with 55.2 mIoU. The
code and models will be available at
https://github.com/microsoft/CSWin-Transformer.
Related papers
- DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition [62.95223898214866]
We explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field.
With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages.
Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks.
arXiv Detail & Related papers (2023-02-03T14:59:31Z) - Global Context Vision Transformers [78.5346173956383]
We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision.
We address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture.
Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks.
arXiv Detail & Related papers (2022-06-20T18:42:44Z) - Adaptive Split-Fusion Transformer [90.04885335911729]
We propose an Adaptive Split-Fusion Transformer (ASF-former) to treat convolutional and attention branches differently with adaptive weights.
Experiments on standard benchmarks, such as ImageNet-1K, show that our ASF-former outperforms its CNN, transformer counterparts, and hybrid pilots in terms of accuracy.
arXiv Detail & Related papers (2022-04-26T10:00:28Z) - UniFormer: Unifying Convolution and Self-attention for Visual
Recognition [69.68907941116127]
Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years.
We propose a novel Unified transFormer (UniFormer) which seamlessly integrates the merits of convolution and self-attention in a concise transformer format.
Our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K classification.
arXiv Detail & Related papers (2022-01-24T04:39:39Z) - Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped
Attention [28.44439386445018]
We propose a Pale-Shaped self-Attention, which performs self-attention within a pale-shaped region.
Compared to the global self-attention, PS-Attention can reduce the computation and memory costs significantly.
We develop a general Vision Transformer backbone with a hierarchical architecture, named Pale Transformer, which achieves 83.4%, 84.3%, and 84.9% Top-1 accuracy with the model size of 22M, 48M, and 85M respectively.
arXiv Detail & Related papers (2021-12-28T05:37:24Z) - Focal Self-attention for Local-Global Interactions in Vision
Transformers [90.9169644436091]
We present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions.
With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers.
arXiv Detail & Related papers (2021-07-01T17:56:09Z) - Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [44.086393272557416]
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision.
It surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones.
arXiv Detail & Related papers (2021-03-25T17:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.