Related papers: VSA: Learning Varied-Size Window Attention in Vision Transformers

VSA: Learning Varied-Size Window Attention in Vision Transformers

URL: http://arxiv.org/abs/2204.08446v2
Date: Mon, 3 Jul 2023 07:49:59 GMT
Title: VSA: Learning Varied-Size Window Attention in Vision Transformers
Authors: Qiming Zhang, Yufei Xu, Jing Zhang, Dacheng Tao
Abstract summary: We propose textbfVaried-textbfSize Window textbfAttention (VSA) to learn adaptive window configurations from data. Based on the tokens within each default window, VSA employs a window regression module to predict the size and location of the target window.
Score: 76.35955924137986
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Attention within windows has been widely explored in vision transformers to balance the performance, computation complexity, and memory footprint. However, current models adopt a hand-crafted fixed-size window design, which restricts their capacity of modeling long-term dependencies and adapting to objects of different sizes. To address this drawback, we propose \textbf{V}aried-\textbf{S}ize Window \textbf{A}ttention (VSA) to learn adaptive window configurations from data. Specifically, based on the tokens within each default window, VSA employs a window regression module to predict the size and location of the target window, i.e., the attention area where the key and value tokens are sampled. By adopting VSA independently for each attention head, it can model long-term dependencies, capture rich context from diverse windows, and promote information exchange among overlapped windows. VSA is an easy-to-implement module that can replace the window attention in state-of-the-art representative models with minor modifications and negligible extra computational cost while improving their performance by a large margin, e.g., 1.1\% for Swin-T on ImageNet classification. In addition, the performance gain increases when using larger images for training and test. Experimental results on more downstream tasks, including object detection, instance segmentation, and semantic segmentation, further demonstrate the superiority of VSA over the vanilla window attention in dealing with objects of different sizes. The code will be released https://github.com/ViTAE-Transformer/ViTAE-VSA.

Related papers

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures [99.20299078655376]
This paper introduces Vision-RWKV, a model adapted from the RWKV model used in the NLP field. Our model is designed to efficiently handle sparse inputs and demonstrate robust global processing capabilities. Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage.
arXiv Detail & Related papers (2024-03-04T18:46:20Z)
RFR-WWANet: Weighted Window Attention-Based Recovery Feature Resolution Network for Unsupervised Image Registration [7.446209993071451]
The Swin transformer has attracted attention in medical image analysis due to its computational efficiency and long-range modeling capability. The registration models based on transformers combine multiple voxels into a single semantic token. This merging process limits the transformers to model and generate coarse-grained spatial information. We propose Recovery Feature Resolution Network (RFRNet), which allows the transformer to contribute fine-grained spatial information.
arXiv Detail & Related papers (2023-05-07T09:57:29Z)
Vision Transformer with Quadrangle Attention [76.35955924137986]
We propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation. Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles. We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost.
arXiv Detail & Related papers (2023-03-27T11:13:50Z)
Sparse Visual Counterfactual Explanations in Image Space [50.768119964318494]
We present a novel model for visual counterfactual explanations in image space. We show that it can be used to detect undesired behavior of ImageNet classifiers due to spurious features in the ImageNet dataset.
arXiv Detail & Related papers (2022-05-16T20:23:11Z)
Beyond Fixation: Dynamic Window Visual Transformer [19.982593341611796]
We propose a novel method, named Dynamic Window Vision Transformer (DW-ViT) In DW-ViT, multi-scale information is obtained by assigning windows of different sizes to different head groups of window multi-head self-attention. DW-ViT exhibits good scalability and can be easily inserted into any window-based visual transformers.
arXiv Detail & Related papers (2022-03-24T05:38:07Z)
Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention [16.75003034164463]
Multi-scale representations are crucial for semantic segmentation. In this paper, we introduce multi-scale representations into semantic segmentation ViT via window attention mechanism. Our resulting ViT, Lawin Transformer, is composed of an efficient vision transformer (HVT) as encoder and a LawinASPP as decoder.
arXiv Detail & Related papers (2022-01-05T13:51:20Z)
Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks. We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer. The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z)
Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length. It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity. Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.