Vision Transformer with Deformable Attention
- URL: http://arxiv.org/abs/2201.00520v1
- Date: Mon, 3 Jan 2022 08:29:01 GMT
- Title: Vision Transformer with Deformable Attention
- Authors: Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, Gao Huang
- Abstract summary: Large, sometimes even global, receptive field endows Transformer models with higher representation power over their CNN counterparts.
We propose a novel deformable self-attention module, where the positions of key and value pairs in self-attention are selected in a data-dependent way.
We present Deformable Attention Transformer, a general backbone model with deformable attention for both image classification and dense prediction tasks.
- Score: 29.935891419574602
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Transformers have recently shown superior performances on various vision
tasks. The large, sometimes even global, receptive field endows Transformer
models with higher representation power over their CNN counterparts.
Nevertheless, simply enlarging receptive field also gives rise to several
concerns. On the one hand, using dense attention e.g., in ViT, leads to
excessive memory and computational cost, and features can be influenced by
irrelevant parts which are beyond the region of interests. On the other hand,
the sparse attention adopted in PVT or Swin Transformer is data agnostic and
may limit the ability to model long range relations. To mitigate these issues,
we propose a novel deformable self-attention module, where the positions of key
and value pairs in self-attention are selected in a data-dependent way. This
flexible scheme enables the self-attention module to focus on relevant regions
and capture more informative features. On this basis, we present Deformable
Attention Transformer, a general backbone model with deformable attention for
both image classification and dense prediction tasks. Extensive experiments
show that our models achieve consistently improved results on comprehensive
benchmarks. Code is available at https://github.com/LeapLabTHU/DAT.
Related papers
- DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention [1.5624421399300303]
We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs)
Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations.
These representations are then adapted for transformer input through an innovative patch tokenization.
arXiv Detail & Related papers (2024-07-18T22:15:35Z) - DAT++: Spatially Dynamic Vision Transformer with Deformable Attention [87.41016963608067]
We present Deformable Attention Transformer ( DAT++), a vision backbone efficient and effective for visual recognition.
DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.
arXiv Detail & Related papers (2023-09-04T08:26:47Z) - Laplacian-Former: Overcoming the Limitations of Vision Transformers in
Local Texture Detection [3.784298636620067]
Vision Transformer (ViT) models have demonstrated a breakthrough in a wide range of computer vision tasks.
These models struggle to capture high-frequency components of images, which can limit their ability to detect local textures and edge information.
We propose a new technique, Laplacian-Former, that enhances the self-attention map by adaptively re-calibrating the frequency information in a Laplacian pyramid.
arXiv Detail & Related papers (2023-08-31T19:56:14Z) - A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism.
We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions.
With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - A free lunch from ViT: Adaptive Attention Multi-scale Fusion Transformer
for Fine-grained Visual Recognition [10.045205311757028]
Learning subtle representation about object parts plays a vital role in fine-grained visual recognition (FGVR) field.
With the fixed size of patches in ViT, the class token in deep layer focuses on the global receptive field and cannot generate multi-granularity features for FGVR.
We propose a novel method named Adaptive attention multi-scale Fusion Transformer (AFTrans) to capture region attention without box annotations.
arXiv Detail & Related papers (2021-10-04T08:11:21Z) - ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [76.16156833138038]
We propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
In each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network.
arXiv Detail & Related papers (2021-06-07T05:31:06Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.