Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer
- URL: http://arxiv.org/abs/2106.03650v1
- Date: Mon, 7 Jun 2021 14:22:07 GMT
- Title: Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer
- Authors: Zilong Huang, Youcheng Ben, Guozhong Luo, Pei Cheng, Gang Yu, Bin Fu
- Abstract summary: We propose a new vision transformer, named Shuffle Transformer, which is highly efficient and easy to implement by modifying two lines of code.
The proposed architectures achieve excellent performance on a wide range of visual tasks including image-level classification, object detection, and semantic segmentation.
- Score: 20.92010433074935
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Very recently, Window-based Transformers, which computed self-attention
within non-overlapping local windows, demonstrated promising results on image
classification, semantic segmentation, and object detection. However, less
study has been devoted to the cross-window connection which is the key element
to improve the representation ability. In this work, we revisit the spatial
shuffle as an efficient way to build connections among windows. As a result, we
propose a new vision transformer, named Shuffle Transformer, which is highly
efficient and easy to implement by modifying two lines of code. Furthermore,
the depth-wise convolution is introduced to complement the spatial shuffle for
enhancing neighbor-window connections. The proposed architectures achieve
excellent performance on a wide range of visual tasks including image-level
classification, object detection, and semantic segmentation. Code will be
released for reproduction.
Related papers
- Vision Transformer with Quadrangle Attention [76.35955924137986]
We propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation.
Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles.
We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost.
arXiv Detail & Related papers (2023-03-27T11:13:50Z) - Degenerate Swin to Win: Plain Window-based Transformer without
Sophisticated Operations [36.57766081271396]
A Vision Transformer has a larger receptive field which is capable of characterizing the long-range dependencies.
To boost efficiency, the window-based Vision Transformers emerge.
We check the necessity of the key design element of Swin Transformer, the shifted window partitioning.
arXiv Detail & Related papers (2022-11-25T17:36:20Z) - Cross Aggregation Transformer for Image Restoration [48.390140041131886]
Recently, Transformer architecture has been introduced into image restoration to replace convolution neural network (CNN) with surprising results.
To address the above issue, we propose a new image restoration model, Cross Aggregation Transformer (CAT)
The core of our CAT is the Rectangle-Window Self-Attention (Rwin-SA), which utilizes horizontal and vertical rectangle window attention in different heads parallelly to expand the attention area and aggregate the features cross different windows.
Furthermore, we propose the Locality Complementary Module to complement the self-attention mechanism, which incorporates the inductive bias of CNN (e.g., translation in
arXiv Detail & Related papers (2022-11-24T15:09:33Z) - Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition [158.15602882426379]
This paper does not attempt to design a state-of-the-art method for visual recognition but investigates a more efficient way to make use of convolutions to encode spatial features.
By comparing the design principles of the recent convolutional neural networks ConvNets) and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation.
arXiv Detail & Related papers (2022-11-22T01:39:45Z) - SSformer: A Lightweight Transformer for Semantic Segmentation [7.787950060560868]
Swin Transformer set a new record in various vision tasks by using hierarchical architecture and shifted windows.
We design a lightweight yet effective transformer model, called SSformer.
Experimental results show the proposed SSformer yields comparable mIoU performance with state-of-the-art models.
arXiv Detail & Related papers (2022-08-03T12:57:00Z) - Iwin: Human-Object Interaction Detection via Transformer with Irregular
Windows [57.00864538284686]
Iwin Transformer is a hierarchical Transformer which progressively performs token representation learning and token agglomeration within irregular windows.
The effectiveness and efficiency of Iwin Transformer are verified on the two standard HOI detection benchmark datasets.
arXiv Detail & Related papers (2022-03-20T12:04:50Z) - Lawin Transformer: Improving Semantic Segmentation Transformer with
Multi-Scale Representations via Large Window Attention [16.75003034164463]
Multi-scale representations are crucial for semantic segmentation.
In this paper, we introduce multi-scale representations into semantic segmentation ViT via window attention mechanism.
Our resulting ViT, Lawin Transformer, is composed of an efficient vision transformer (HVT) as encoder and a LawinASPP as decoder.
arXiv Detail & Related papers (2022-01-05T13:51:20Z) - Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
Transformers have demonstrated great potential in computer vision tasks.
Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows.
This design significantly improves the efficiency but lacks global feature reasoning in early stages.
In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
arXiv Detail & Related papers (2021-07-10T02:34:55Z) - Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [63.46694853953092]
Swin-Unet is an Unet-like pure Transformer for medical image segmentation.
tokenized image patches are fed into the Transformer-based U-shaped decoder-Decoder architecture.
arXiv Detail & Related papers (2021-05-12T09:30:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.