Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
- URL: http://arxiv.org/abs/2103.14030v1
- Date: Thu, 25 Mar 2021 17:59:31 GMT
- Title: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
- Authors: Ze Liu and Yutong Lin and Yue Cao and Han Hu and Yixuan Wei and Zheng
Zhang and Stephen Lin and Baining Guo
- Abstract summary: This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision.
It surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones.
- Score: 44.086393272557416
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a new vision Transformer, called Swin Transformer, that
capably serves as a general-purpose backbone for computer vision. Challenges in
adapting Transformer from language to vision arise from differences between the
two domains, such as large variations in the scale of visual entities and the
high resolution of pixels in images compared to words in text. To address these
differences, we propose a hierarchical Transformer whose representation is
computed with shifted windows. The shifted windowing scheme brings greater
efficiency by limiting self-attention computation to non-overlapping local
windows while also allowing for cross-window connection. This hierarchical
architecture has the flexibility to model at various scales and has linear
computational complexity with respect to image size. These qualities of Swin
Transformer make it compatible with a broad range of vision tasks, including
image classification (86.4 top-1 accuracy on ImageNet-1K) and dense prediction
tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev)
and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses
the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP
on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of
Transformer-based models as vision backbones. The code and models will be made
publicly available at~\url{https://github.com/microsoft/Swin-Transformer}.
Related papers
- Vision Transformer with Quadrangle Attention [76.35955924137986]
We propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation.
Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles.
We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost.
arXiv Detail & Related papers (2023-03-27T11:13:50Z) - 3D UX-Net: A Large Kernel Volumetric ConvNet Modernizing Hierarchical
Transformer for Medical Image Segmentation [5.635173603669784]
We propose a lightweight volumetric ConvNet, termed 3D UX-Net, which adapts the hierarchical transformer using ConvNet modules for robust volumetric segmentation.
Specifically, we revisit volumetric depth-wise convolutions with large kernel size (e.g. starting from $7times7times7$) to enable the larger global receptive fields, inspired by Swin Transformer.
arXiv Detail & Related papers (2022-09-29T19:54:13Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy
for Image Recognition without Convolutions [1.1032962642000486]
This work is based on Vision Transformer, combined with the pyramid architecture, using Split-merge-transform to propose the group encoder and name the network architecture Aggregated Pyramid Vision Transformer (APVT)
We perform image classification tasks on the CIFAR-10 dataset and object detection tasks on the COCO 2017 dataset.
arXiv Detail & Related papers (2022-03-02T09:14:28Z) - What Makes for Hierarchical Vision Transformer? [46.848348453909495]
We replace self-attention layers in Swin Transformer and Shuffle Transformer with simple linear mapping and keep other components unchanged.
The resulting architecture with 25.4M parameters and 4.2G FLOPs achieves 80.5% Top-1 accuracy, compared to 81.3% for Swin Transformer with 28.3M parameters and 4.5G FLOPs.
arXiv Detail & Related papers (2021-07-05T17:59:35Z) - CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows [99.36226415086243]
We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks.
A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
arXiv Detail & Related papers (2021-07-01T17:59:56Z) - Focal Self-attention for Local-Global Interactions in Vision
Transformers [90.9169644436091]
We present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions.
With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers.
arXiv Detail & Related papers (2021-07-01T17:56:09Z) - An Image is Worth 16x16 Words: Transformers for Image Recognition at
Scale [112.94212299087653]
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
arXiv Detail & Related papers (2020-10-22T17:55:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.