Degenerate Swin to Win: Plain Window-based Transformer without
Sophisticated Operations
- URL: http://arxiv.org/abs/2211.14255v1
- Date: Fri, 25 Nov 2022 17:36:20 GMT
- Title: Degenerate Swin to Win: Plain Window-based Transformer without
Sophisticated Operations
- Authors: Tan Yu, Ping Li
- Abstract summary: A Vision Transformer has a larger receptive field which is capable of characterizing the long-range dependencies.
To boost efficiency, the window-based Vision Transformers emerge.
We check the necessity of the key design element of Swin Transformer, the shifted window partitioning.
- Score: 36.57766081271396
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The formidable accomplishment of Transformers in natural language processing
has motivated the researchers in the computer vision community to build Vision
Transformers. Compared with the Convolution Neural Networks (CNN), a Vision
Transformer has a larger receptive field which is capable of characterizing the
long-range dependencies. Nevertheless, the large receptive field of Vision
Transformer is accompanied by the huge computational cost. To boost efficiency,
the window-based Vision Transformers emerge. They crop an image into several
local windows, and the self-attention is conducted within each window. To bring
back the global receptive field, window-based Vision Transformers have devoted
a lot of efforts to achieving cross-window communications by developing several
sophisticated operations. In this work, we check the necessity of the key
design element of Swin Transformer, the shifted window partitioning. We
discover that a simple depthwise convolution is sufficient for achieving
effective cross-window communications. Specifically, with the existence of the
depthwise convolution, the shifted window configuration in Swin Transformer
cannot lead to an additional performance improvement. Thus, we degenerate the
Swin Transformer to a plain Window-based (Win) Transformer by discarding
sophisticated shifted window partitioning. The proposed Win Transformer is
conceptually simpler and easier for implementation than Swin Transformer.
Meanwhile, our Win Transformer achieves consistently superior performance than
Swin Transformer on multiple computer vision tasks, including image
recognition, semantic segmentation, and object detection.
Related papers
- Interpret Vision Transformers as ConvNets with Dynamic Convolutions [70.59235381143831]
We interpret vision Transformers as ConvNets with dynamic convolutions, which enables us to characterize existing Transformers and dynamic ConvNets in a unified framework.
Our interpretation can also guide the network design as researchers now can consider vision Transformers from the design space of ConvNets.
arXiv Detail & Related papers (2023-09-19T16:00:49Z) - Swin-Free: Achieving Better Cross-Window Attention and Efficiency with
Size-varying Window [6.158271948005819]
We propose Swin-Free in which we apply size-varying windows across stages, instead of shifting windows, to achieve cross-connection among local windows.
With this simple design change, Swin-Free runs faster than the Swin Transformer at inference with better accuracy.
arXiv Detail & Related papers (2023-06-23T20:19:58Z) - SSformer: A Lightweight Transformer for Semantic Segmentation [7.787950060560868]
Swin Transformer set a new record in various vision tasks by using hierarchical architecture and shifted windows.
We design a lightweight yet effective transformer model, called SSformer.
Experimental results show the proposed SSformer yields comparable mIoU performance with state-of-the-art models.
arXiv Detail & Related papers (2022-08-03T12:57:00Z) - TransVG++: End-to-End Visual Grounding with Language Conditioned Vision
Transformer [188.00681648113223]
We explore neat yet effective Transformer-based frameworks for visual grounding.
TransVG establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates.
We upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding.
arXiv Detail & Related papers (2022-06-14T06:27:38Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - Towards Lightweight Transformer via Group-wise Transformation for
Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer.
We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets.
Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z) - A ConvNet for the 2020s [94.89735578018099]
Vision Transformers (ViTs) quickly superseded ConvNets as the state-of-the-art image classification model.
It is the hierarchical Transformers that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone.
In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve.
arXiv Detail & Related papers (2022-01-10T18:59:10Z) - Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer [20.92010433074935]
We propose a new vision transformer, named Shuffle Transformer, which is highly efficient and easy to implement by modifying two lines of code.
The proposed architectures achieve excellent performance on a wide range of visual tasks including image-level classification, object detection, and semantic segmentation.
arXiv Detail & Related papers (2021-06-07T14:22:07Z) - Glance-and-Gaze Vision Transformer [13.77016463781053]
We propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer)
It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes.
We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers.
arXiv Detail & Related papers (2021-06-04T06:13:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.