DualToken-ViT: Position-aware Efficient Vision Transformer with Dual
Token Fusion
- URL: http://arxiv.org/abs/2309.12424v1
- Date: Thu, 21 Sep 2023 18:46:32 GMT
- Title: DualToken-ViT: Position-aware Efficient Vision Transformer with Dual
Token Fusion
- Authors: Zhenzhen Chu, Jiayu Chen, Cen Chen, Chengyu Wang, Ziheng Wu, Jun
Huang, Weining Qian
- Abstract summary: Self-attention-based vision transformers (ViTs) have emerged as a highly competitive architecture in computer vision.
We propose a light-weight and efficient vision transformer model called DualToken-ViT.
- Score: 25.092756016673235
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-attention-based vision transformers (ViTs) have emerged as a highly
competitive architecture in computer vision. Unlike convolutional neural
networks (CNNs), ViTs are capable of global information sharing. With the
development of various structures of ViTs, ViTs are increasingly advantageous
for many vision tasks. However, the quadratic complexity of self-attention
renders ViTs computationally intensive, and their lack of inductive biases of
locality and translation equivariance demands larger model sizes compared to
CNNs to effectively learn visual features. In this paper, we propose a
light-weight and efficient vision transformer model called DualToken-ViT that
leverages the advantages of CNNs and ViTs. DualToken-ViT effectively fuses the
token with local information obtained by convolution-based structure and the
token with global information obtained by self-attention-based structure to
achieve an efficient attention structure. In addition, we use position-aware
global tokens throughout all stages to enrich the global information, which
further strengthening the effect of DualToken-ViT. Position-aware global tokens
also contain the position information of the image, which makes our model
better for vision tasks. We conducted extensive experiments on image
classification, object detection and semantic segmentation tasks to demonstrate
the effectiveness of DualToken-ViT. On the ImageNet-1K dataset, our models of
different scales achieve accuracies of 75.4% and 79.4% with only 0.5G and 1.0G
FLOPs, respectively, and our model with 1.0G FLOPs outperforms LightViT-T using
global tokens by 0.7%.
Related papers
- Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads [10.169639612525643]
Visual perception tasks are predominantly solved by ViT, despite their effectiveness.
Despite their effectiveness, ViT encounters a computational bottleneck due to the complexity of computing self-attention.
We propose Fibottention architecture, which approximating self-attention that is built upon.
arXiv Detail & Related papers (2024-06-27T17:59:40Z) - FasterViT: Fast Vision Transformers with Hierarchical Attention [63.50580266223651]
We design a new family of hybrid CNN-ViT neural networks, named FasterViT, with a focus on high image throughput for computer vision (CV) applications.
Our newly introduced Hierarchical Attention (HAT) approach decomposes global self-attention with quadratic complexity into a multi-level attention with reduced computational costs.
arXiv Detail & Related papers (2023-06-09T18:41:37Z) - Lightweight Vision Transformer with Bidirectional Interaction [63.65115590184169]
We propose a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information.
Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family.
arXiv Detail & Related papers (2023-06-01T06:56:41Z) - Grafting Vision Transformers [42.71480918208436]
Vision Transformers (ViTs) have recently become the state-of-the-art across many computer vision tasks.
GrafT considers global dependencies and multi-scale information throughout the network.
It has the flexibility of branching out at arbitrary depths and shares most of the parameters and computations of the backbone.
arXiv Detail & Related papers (2022-10-28T07:07:13Z) - EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision
Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision.
We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - Discrete Representations Strengthen Vision Transformer Robustness [43.821734467553554]
Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition.
We present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder.
Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks.
arXiv Detail & Related papers (2021-11-20T01:49:56Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.