DaViT: Dual Attention Vision Transformers
- URL: http://arxiv.org/abs/2204.03645v1
- Date: Thu, 7 Apr 2022 17:59:32 GMT
- Title: DaViT: Dual Attention Vision Transformers
- Authors: Mingyu Ding, Bin Xiao, Noel Codella, Ping Luo, Jingdong Wang, Lu Yuan
- Abstract summary: We introduce Dual Attention Vision Transformers (DaViT)
DaViT is a vision transformer architecture that is able to capture global context while maintaining computational efficiency.
We show our DaViT achieves state-of-the-art performance on four different tasks with efficient computations.
- Score: 94.62855697081079
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we introduce Dual Attention Vision Transformers (DaViT), a
simple yet effective vision transformer architecture that is able to capture
global context while maintaining computational efficiency. We propose
approaching the problem from an orthogonal angle: exploiting self-attention
mechanisms with both "spatial tokens" and "channel tokens". With spatial
tokens, the spatial dimension defines the token scope, and the channel
dimension defines the token feature dimension. With channel tokens, we have the
inverse: the channel dimension defines the token scope, and the spatial
dimension defines the token feature dimension. We further group tokens along
the sequence direction for both spatial and channel tokens to maintain the
linear complexity of the entire model. We show that these two self-attentions
complement each other: (i) since each channel token contains an abstract
representation of the entire image, the channel attention naturally captures
global interactions and representations by taking all spatial positions into
account when computing attention scores between channels; (ii) the spatial
attention refines the local representations by performing fine-grained
interactions across spatial locations, which in turn helps the global
information modeling in channel attention. Extensive experiments show our DaViT
achieves state-of-the-art performance on four different tasks with efficient
computations. Without extra data, DaViT-Tiny, DaViT-Small, and DaViT-Base
achieve 82.8%, 84.2%, and 84.6% top-1 accuracy on ImageNet-1K with 28.3M,
49.7M, and 87.9M parameters, respectively. When we further scale up DaViT with
1.5B weakly supervised image and text pairs, DaViT-Gaint reaches 90.4% top-1
accuracy on ImageNet-1K. Code is available at https://github.com/dingmyu/davit.
Related papers
- Fusion of regional and sparse attention in Vision Transformers [4.782322901897837]
Modern vision transformers leverage visually inspired local interaction between pixels through attention computed within window or grid regions.
We propose Atrous Attention, a blend of regional and sparse attention that dynamically integrates both local and global information.
Our compact model achieves approximately 84% accuracy on ImageNet-1K with fewer than 28.5 million parameters, outperforming the state-of-the-art MaxViT by 0.42%.
arXiv Detail & Related papers (2024-06-13T06:48:25Z) - Sub-token ViT Embedding via Stochastic Resonance Transformers [51.12001699637727]
Vision Transformer (ViT) architectures represent images as collections of high-dimensional vectorized tokens, each corresponding to a rectangular non-overlapping patch.
We propose a training-free method inspired by "stochastic resonance"
The resulting "Stochastic Resonance Transformer" (SRT) retains the rich semantic information of the original representation, but grounds it on a finer-scale spatial domain, partly mitigating the coarse effect of spatial tokenization.
arXiv Detail & Related papers (2023-10-06T01:53:27Z) - DualToken-ViT: Position-aware Efficient Vision Transformer with Dual
Token Fusion [25.092756016673235]
Self-attention-based vision transformers (ViTs) have emerged as a highly competitive architecture in computer vision.
We propose a light-weight and efficient vision transformer model called DualToken-ViT.
arXiv Detail & Related papers (2023-09-21T18:46:32Z) - Efficient Multi-Scale Attention Module with Cross-Spatial Learning [4.046170185945849]
A novel efficient multi-scale attention (EMA) module is proposed.
We focus on retaining the information on per channel and decreasing the computational overhead.
We conduct extensive ablation studies and experiments on image classification and object detection tasks.
arXiv Detail & Related papers (2023-05-23T00:35:47Z) - Making Vision Transformers Efficient from A Token Sparsification View [26.42498120556985]
We propose a novel Semantic Token ViT (STViT) for efficient global and local vision transformers.
Our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone.
In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks.
arXiv Detail & Related papers (2023-03-15T15:12:36Z) - Bridging the Gap Between Vision Transformers and Convolutional Neural
Networks on Small Datasets [91.25055890980084]
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets.
We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases.
Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
arXiv Detail & Related papers (2022-10-12T06:54:39Z) - PSViT: Better Vision Transformer via Token Pooling and Attention Sharing [114.8051035856023]
We propose a PSViT: a ViT with token Pooling and attention Sharing to reduce the redundancy.
Experimental results show that the proposed scheme can achieve up to 6.6% accuracy improvement in ImageNet classification.
arXiv Detail & Related papers (2021-08-07T11:30:54Z) - Vision Permutator: A Permutable MLP-Like Architecture for Visual
Recognition [185.80889967154963]
We present Vision Permutator, a conceptually simple and data efficient-like architecture for visual recognition.
By realizing the importance of the positional information carried by 2D feature representations, Vision Permutator encodes the feature representations along the height and width dimensions with linear projections.
We show that our Vision Permutators are formidable competitors to convolutional neural networks (CNNs) and vision transformers.
arXiv Detail & Related papers (2021-06-23T13:05:23Z) - Dual Attention GANs for Semantic Image Synthesis [101.36015877815537]
We propose a novel Dual Attention GAN (DAGAN) to synthesize photo-realistic and semantically-consistent images.
We also propose two novel modules, i.e., position-wise Spatial Attention Module (SAM) and scale-wise Channel Attention Module (CAM)
DAGAN achieves remarkably better results than state-of-the-art methods, while using fewer model parameters.
arXiv Detail & Related papers (2020-08-29T17:49:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.