DAT++: Spatially Dynamic Vision Transformer with Deformable Attention
- URL: http://arxiv.org/abs/2309.01430v1
- Date: Mon, 4 Sep 2023 08:26:47 GMT
- Title: DAT++: Spatially Dynamic Vision Transformer with Deformable Attention
- Authors: Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, Gao Huang
- Abstract summary: We present Deformable Attention Transformer ( DAT++), a vision backbone efficient and effective for visual recognition.
DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.
- Score: 87.41016963608067
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Transformers have shown superior performance on various vision tasks. Their
large receptive field endows Transformer models with higher representation
power than their CNN counterparts. Nevertheless, simply enlarging the receptive
field also raises several concerns. On the one hand, using dense attention in
ViT leads to excessive memory and computational cost, and features can be
influenced by irrelevant parts that are beyond the region of interests. On the
other hand, the handcrafted attention adopted in PVT or Swin Transformer is
data agnostic and may limit the ability to model long-range relations. To solve
this dilemma, we propose a novel deformable multi-head attention module, where
the positions of key and value pairs in self-attention are adaptively allocated
in a data-dependent way. This flexible scheme enables the proposed deformable
attention to dynamically focus on relevant regions while maintains the
representation power of global attention. On this basis, we present Deformable
Attention Transformer (DAT), a general vision backbone efficient and effective
for visual recognition. We further build an enhanced version DAT++. Extensive
experiments show that our DAT++ achieves state-of-the-art results on various
visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0
MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.
Related papers
- DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention [1.5624421399300303]
We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs)
Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations.
These representations are then adapted for transformer input through an innovative patch tokenization.
arXiv Detail & Related papers (2024-07-18T22:15:35Z) - Laplacian-Former: Overcoming the Limitations of Vision Transformers in
Local Texture Detection [3.784298636620067]
Vision Transformer (ViT) models have demonstrated a breakthrough in a wide range of computer vision tasks.
These models struggle to capture high-frequency components of images, which can limit their ability to detect local textures and edge information.
We propose a new technique, Laplacian-Former, that enhances the self-attention map by adaptively re-calibrating the frequency information in a Laplacian pyramid.
arXiv Detail & Related papers (2023-08-31T19:56:14Z) - A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism.
We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions.
With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Vision Transformer with Deformable Attention [29.935891419574602]
Large, sometimes even global, receptive field endows Transformer models with higher representation power over their CNN counterparts.
We propose a novel deformable self-attention module, where the positions of key and value pairs in self-attention are selected in a data-dependent way.
We present Deformable Attention Transformer, a general backbone model with deformable attention for both image classification and dense prediction tasks.
arXiv Detail & Related papers (2022-01-03T08:29:01Z) - TDAN: Top-Down Attention Networks for Enhanced Feature Selectivity in
CNNs [18.24779045808196]
We propose a lightweight top-down (TD) attention module that iteratively generates a "visual searchlight" to perform top-down channel and spatial modulation of its inputs.
Our models are more robust to changes in input resolution during inference and learn to "shift attention" by localizing individual objects or features at each computation step without any explicit supervision.
arXiv Detail & Related papers (2021-11-26T12:35:17Z) - A free lunch from ViT: Adaptive Attention Multi-scale Fusion Transformer
for Fine-grained Visual Recognition [10.045205311757028]
Learning subtle representation about object parts plays a vital role in fine-grained visual recognition (FGVR) field.
With the fixed size of patches in ViT, the class token in deep layer focuses on the global receptive field and cannot generate multi-granularity features for FGVR.
We propose a novel method named Adaptive attention multi-scale Fusion Transformer (AFTrans) to capture region attention without box annotations.
arXiv Detail & Related papers (2021-10-04T08:11:21Z) - Focal Self-attention for Local-Global Interactions in Vision
Transformers [90.9169644436091]
We present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions.
With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers.
arXiv Detail & Related papers (2021-07-01T17:56:09Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.