Related papers: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

URL: http://arxiv.org/abs/2103.14030v1
Date: Thu, 25 Mar 2021 17:59:31 GMT
Title: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Authors: Ze Liu and Yutong Lin and Yue Cao and Han Hu and Yixuan Wei and Zheng Zhang and Stephen Lin and Baining Guo
Abstract summary: This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. It surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones.
Score: 44.086393272557416
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (86.4 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The code and models will be made publicly available at~\url{https://github.com/microsoft/Swin-Transformer}.

Related papers

Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows [1.6893365456432206]
We introduce Iwin Transformer, a novel position-embedding-free hierarchical vision transformer.<n>Iwin Transformer can be fine-tuned directly from low to high resolution, through the collaboration of innovative interleaved window attention and depthwise separable convolution.<n>Iwin Transformer exhibits strong competitiveness in tasks such as image classification, semantic segmentation and video action recognition.
arXiv Detail & Related papers (2025-07-24T13:45:48Z)
Vision Transformer with Quadrangle Attention [76.35955924137986]
We propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation. Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles. We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost.
arXiv Detail & Related papers (2023-03-27T11:13:50Z)
3D UX-Net: A Large Kernel Volumetric ConvNet Modernizing Hierarchical Transformer for Medical Image Segmentation [5.635173603669784]
We propose a lightweight volumetric ConvNet, termed 3D UX-Net, which adapts the hierarchical transformer using ConvNet modules for robust volumetric segmentation. Specifically, we revisit volumetric depth-wise convolutions with large kernel size (e.g. starting from $7times7times7$) to enable the larger global receptive fields, inspired by Swin Transformer.
arXiv Detail & Related papers (2022-09-29T19:54:13Z)
HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT) HiViT enjoys both high efficiency and good performance in MIM. In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z)
Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy for Image Recognition without Convolutions [1.1032962642000486]
This work is based on Vision Transformer, combined with the pyramid architecture, using Split-merge-transform to propose the group encoder and name the network architecture Aggregated Pyramid Vision Transformer (APVT) We perform image classification tasks on the CIFAR-10 dataset and object detection tasks on the COCO 2017 dataset.
arXiv Detail & Related papers (2022-03-02T09:14:28Z)
What Makes for Hierarchical Vision Transformer? [46.848348453909495]
We replace self-attention layers in Swin Transformer and Shuffle Transformer with simple linear mapping and keep other components unchanged. The resulting architecture with 25.4M parameters and 4.2G FLOPs achieves 80.5% Top-1 accuracy, compared to 81.3% for Swin Transformer with 28.3M parameters and 4.5G FLOPs.
arXiv Detail & Related papers (2021-07-05T17:59:35Z)
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows [99.36226415086243]
We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
arXiv Detail & Related papers (2021-07-01T17:59:56Z)
Focal Self-attention for Local-Global Interactions in Vision Transformers [90.9169644436091]
We present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions. With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers.
arXiv Detail & Related papers (2021-07-01T17:56:09Z)
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [112.94212299087653]
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
arXiv Detail & Related papers (2020-10-22T17:55:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.