Beyond Fixation: Dynamic Window Visual Transformer
- URL: http://arxiv.org/abs/2203.12856v1
- Date: Thu, 24 Mar 2022 05:38:07 GMT
- Title: Beyond Fixation: Dynamic Window Visual Transformer
- Authors: Pengzhen Ren, Changlin Li, Guangrun Wang, Yun Xiao, Qing Du Xiaodan
Liang Xiaojun Chang
- Abstract summary: We propose a novel method, named Dynamic Window Vision Transformer (DW-ViT)
In DW-ViT, multi-scale information is obtained by assigning windows of different sizes to different head groups of window multi-head self-attention.
DW-ViT exhibits good scalability and can be easily inserted into any window-based visual transformers.
- Score: 19.982593341611796
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, a surge of interest in visual transformers is to reduce the
computational cost by limiting the calculation of self-attention to a local
window. Most current work uses a fixed single-scale window for modeling by
default, ignoring the impact of window size on model performance. However, this
may limit the modeling potential of these window-based models for multi-scale
information. In this paper, we propose a novel method, named Dynamic Window
Vision Transformer (DW-ViT). The dynamic window strategy proposed by DW-ViT
goes beyond the model that employs a fixed single window setting. To the best
of our knowledge, we are the first to use dynamic multi-scale windows to
explore the upper limit of the effect of window settings on model performance.
In DW-ViT, multi-scale information is obtained by assigning windows of
different sizes to different head groups of window multi-head self-attention.
Then, the information is dynamically fused by assigning different weights to
the multi-scale window branches. We conducted a detailed performance evaluation
on three datasets, ImageNet-1K, ADE20K, and COCO. Compared with related
state-of-the-art (SoTA) methods, DW-ViT obtains the best performance.
Specifically, compared with the current SoTA Swin Transformers
\cite{liu2021swin}, DW-ViT has achieved consistent and substantial improvements
on all three datasets with similar parameters and computational costs. In
addition, DW-ViT exhibits good scalability and can be easily inserted into any
window-based visual transformers.
Related papers
- HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution [70.52256118833583]
We present a strategy to convert transformer-based SR networks to hierarchical transformers (HiT-SR)
Specifically, we first replace the commonly used fixed small windows with expanding hierarchical windows to aggregate features at different scales.
Considering the intensive computation required for large windows, we further design a spatial-channel correlation method with linear complexity to window sizes.
arXiv Detail & Related papers (2024-07-08T12:42:10Z) - Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design [84.34416126115732]
Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration.
We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers.
Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute.
arXiv Detail & Related papers (2023-05-22T13:39:28Z) - VSA: Learning Varied-Size Window Attention in Vision Transformers [76.35955924137986]
We propose textbfVaried-textbfSize Window textbfAttention (VSA) to learn adaptive window configurations from data.
Based on the tokens within each default window, VSA employs a window regression module to predict the size and location of the target window.
arXiv Detail & Related papers (2022-04-18T17:56:07Z) - MixFormer: Mixing Features across Windows and Dimensions [68.86393312123168]
Local-window self-attention performs notably in vision tasks, but suffers from limited receptive field and weak modeling capability issues.
This is mainly because it performs self-attention within non-overlapped windows and shares weights on the channel dimension.
We combine local-window self-attention with depth-wise convolution in a parallel design, modeling cross-window connections to enlarge the receptive fields.
arXiv Detail & Related papers (2022-04-06T03:13:50Z) - Lawin Transformer: Improving Semantic Segmentation Transformer with
Multi-Scale Representations via Large Window Attention [16.75003034164463]
Multi-scale representations are crucial for semantic segmentation.
In this paper, we introduce multi-scale representations into semantic segmentation ViT via window attention mechanism.
Our resulting ViT, Lawin Transformer, is composed of an efficient vision transformer (HVT) as encoder and a LawinASPP as decoder.
arXiv Detail & Related papers (2022-01-05T13:51:20Z) - SimViT: Exploring a Simple Vision Transformer with sliding windows [3.3107339588116123]
We introduce a vision Transformer named SimViT, to incorporate spatial structure and local information into the vision Transformers.
SimViT extracts multi-scale hierarchical features from different layers for dense prediction tasks.
Our SimViT-Micro only needs 3.3M parameters to achieve 71.1% top-1 accuracy on ImageNet-1k dataset.
arXiv Detail & Related papers (2021-12-24T15:18:20Z) - Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks.
We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs.
Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z) - What Makes for Hierarchical Vision Transformer? [46.848348453909495]
We replace self-attention layers in Swin Transformer and Shuffle Transformer with simple linear mapping and keep other components unchanged.
The resulting architecture with 25.4M parameters and 4.2G FLOPs achieves 80.5% Top-1 accuracy, compared to 81.3% for Swin Transformer with 28.3M parameters and 4.5G FLOPs.
arXiv Detail & Related papers (2021-07-05T17:59:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.