What Makes for Hierarchical Vision Transformer?
- URL: http://arxiv.org/abs/2107.02174v1
- Date: Mon, 5 Jul 2021 17:59:35 GMT
- Title: What Makes for Hierarchical Vision Transformer?
- Authors: Yuxin Fang, Xinggang Wang, Rui Wu, Jianwei Niu, Wenyu Liu
- Abstract summary: We replace self-attention layers in Swin Transformer and Shuffle Transformer with simple linear mapping and keep other components unchanged.
The resulting architecture with 25.4M parameters and 4.2G FLOPs achieves 80.5% Top-1 accuracy, compared to 81.3% for Swin Transformer with 28.3M parameters and 4.5G FLOPs.
- Score: 46.848348453909495
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies show that hierarchical Vision Transformer with interleaved
non-overlapped intra window self-attention \& shifted window self-attention is
able to achieve state-of-the-art performance in various visual recognition
tasks and challenges CNN's dense sliding window paradigm. Most follow-up works
try to replace shifted window operation with other kinds of cross window
communication while treating self-attention as the de-facto standard for intra
window information aggregation. In this short preprint, we question whether
self-attention is the only choice for hierarchical Vision Transformer to attain
strong performance, and what makes for hierarchical Vision Transformer? We
replace self-attention layers in Swin Transformer and Shuffle Transformer with
simple linear mapping and keep other components unchanged. The resulting
architecture with 25.4M parameters and 4.2G FLOPs achieves 80.5\% Top-1
accuracy, compared to 81.3\% for Swin Transformer with 28.3M parameters and
4.5G FLOPs. We also experiment with other alternatives to self-attention for
context aggregation inside each non-overlapped window, which all give similar
competitive results under the same architecture. Our study reveals that the
\textbf{macro architecture} of Swin model families (i.e., interleaved intra
window \& cross window communications), other than specific aggregation layers
or specific means of cross window communication, may be more responsible for
its strong performance and is the real challenger to CNN's dense sliding window
paradigm.
Related papers
- Vision Transformer with Quadrangle Attention [76.35955924137986]
We propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation.
Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles.
We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost.
arXiv Detail & Related papers (2023-03-27T11:13:50Z) - Cross Aggregation Transformer for Image Restoration [48.390140041131886]
Recently, Transformer architecture has been introduced into image restoration to replace convolution neural network (CNN) with surprising results.
To address the above issue, we propose a new image restoration model, Cross Aggregation Transformer (CAT)
The core of our CAT is the Rectangle-Window Self-Attention (Rwin-SA), which utilizes horizontal and vertical rectangle window attention in different heads parallelly to expand the attention area and aggregate the features cross different windows.
Furthermore, we propose the Locality Complementary Module to complement the self-attention mechanism, which incorporates the inductive bias of CNN (e.g., translation in
arXiv Detail & Related papers (2022-11-24T15:09:33Z) - Green Hierarchical Vision Transformer for Masked Image Modeling [54.14989750044489]
We present an efficient approach for Masked Image Modeling with hierarchical Vision Transformers (ViTs)
We design a Group Window Attention scheme following the Divide-and-Conquer strategy.
We further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall cost of the attention on the grouped patches.
arXiv Detail & Related papers (2022-05-26T17:34:42Z) - SepViT: Separable Vision Transformer [20.403430632658946]
Vision Transformers often rely on extensive computational costs to achieve high performance, which is burdensome to deploy on resource-constrained devices.
We draw lessons from depthwise separable convolution and imitate its ideology to design an efficient Transformer backbone, i.e., Separable Vision Transformer, abbreviated as SepViT.
SepViT helps to carry out the local-global information interaction within and among the windows in sequential order via a depthwise separable self-attention.
arXiv Detail & Related papers (2022-03-29T09:20:01Z) - Lawin Transformer: Improving Semantic Segmentation Transformer with
Multi-Scale Representations via Large Window Attention [16.75003034164463]
Multi-scale representations are crucial for semantic segmentation.
In this paper, we introduce multi-scale representations into semantic segmentation ViT via window attention mechanism.
Our resulting ViT, Lawin Transformer, is composed of an efficient vision transformer (HVT) as encoder and a LawinASPP as decoder.
arXiv Detail & Related papers (2022-01-05T13:51:20Z) - Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
Transformers have demonstrated great potential in computer vision tasks.
Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows.
This design significantly improves the efficiency but lacks global feature reasoning in early stages.
In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
arXiv Detail & Related papers (2021-07-10T02:34:55Z) - CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows [99.36226415086243]
We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks.
A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
arXiv Detail & Related papers (2021-07-01T17:59:56Z) - Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [44.086393272557416]
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision.
It surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones.
arXiv Detail & Related papers (2021-03-25T17:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.