CageViT: Convolutional Activation Guided Efficient Vision Transformer
- URL: http://arxiv.org/abs/2305.09924v1
- Date: Wed, 17 May 2023 03:19:18 GMT
- Title: CageViT: Convolutional Activation Guided Efficient Vision Transformer
- Authors: Hao Zheng, Jinbao Wang, Xiantong Zhen, Hong Chen, Jingkuan Song, Feng
Zheng
- Abstract summary: This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
- Score: 90.69578999760206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Transformers have emerged as the go-to architecture for both vision
and language modeling tasks, but their computational efficiency is limited by
the length of the input sequence. To address this, several efficient variants
of Transformers have been proposed to accelerate computation or reduce memory
consumption while preserving performance. This paper presents an efficient
vision Transformer, called CageViT, that is guided by convolutional activation
to reduce computation. Our CageViT, unlike current Transformers, utilizes a new
encoder to handle the rearranged tokens, bringing several technical
contributions: 1) Convolutional activation is used to pre-process the token
after patchifying the image to select and rearrange the major tokens and minor
tokens, which substantially reduces the computation cost through an additional
fusion layer. 2) Instead of using the class activation map of the convolutional
model directly, we design a new weighted class activation to lower the model
requirements. 3) To facilitate communication between major tokens and fusion
tokens, Gated Linear SRA is proposed to further integrate fusion tokens into
the attention mechanism. We perform a comprehensive validation of CageViT on
the image classification challenge.
Experimental results demonstrate that the proposed CageViT outperforms the
most recent state-of-the-art backbones by a large margin in terms of
efficiency, while maintaining a comparable level of accuracy (e.g. a
moderate-sized 43.35M model trained solely on 224 x 224 ImageNet-1K can achieve
Top-1 accuracy of 83.4% accuracy).
Related papers
- GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation [30.343504537684755]
Vision Transformers (ViTs) have revolutionized the field of computer vision, yet their deployments on resource-constrained devices remain challenging.
To expedite ViTs, token pruning and token merging approaches have been developed, which aim at reducing the number of tokens involved in computation.
We introduce a novel Graph-based Token Propagation (GTP) method to resolve the challenge of balancing model efficiency and information preservation for efficient ViTs.
arXiv Detail & Related papers (2023-11-06T11:14:19Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - Three things everyone should know about Vision Transformers [67.30250766591405]
transformer architectures have rapidly gained traction in computer vision.
We offer three insights based on simple and easy to implement variants of vision transformers.
We evaluate the impact of these design choices using the ImageNet-1k dataset, and confirm our findings on the ImageNet-v2 test set.
arXiv Detail & Related papers (2022-03-18T08:23:03Z) - Multi-Tailed Vision Transformer for Efficient Inference [44.43126137573205]
Vision Transformer (ViT) has achieved promising performance in image recognition.
We propose a Multi-Tailed Vision Transformer (MT-ViT) in the paper.
MT-ViT adopts multiple tails to produce visual sequences of different lengths for the following Transformer encoder.
arXiv Detail & Related papers (2022-03-03T09:30:55Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z) - Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer [63.99222215387881]
We propose Evo-ViT, a self-motivated slow-fast token evolution method for vision transformers.
Our method can significantly reduce the computational costs of vision transformers while maintaining comparable performance on image classification.
arXiv Detail & Related papers (2021-08-03T09:56:07Z) - Patch Slimming for Efficient Vision Transformers [107.21146699082819]
We study the efficiency problem for visual transformers by excavating redundant calculation in given networks.
We present a novel patch slimming approach that discards useless patches in a top-down paradigm.
Experimental results on benchmark datasets demonstrate that the proposed method can significantly reduce the computational costs of vision transformers.
arXiv Detail & Related papers (2021-06-05T09:46:00Z) - CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image
Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features.
Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity.
Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.