Related papers: CageViT: Convolutional Activation Guided Efficient Vision Transformer

CageViT: Convolutional Activation Guided Efficient Vision Transformer

URL: http://arxiv.org/abs/2305.09924v1
Date: Wed, 17 May 2023 03:19:18 GMT
Title: CageViT: Convolutional Activation Guided Efficient Vision Transformer
Authors: Hao Zheng, Jinbao Wang, Xiantong Zhen, Hong Chen, Jingkuan Song, Feng Zheng
Abstract summary: This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation. Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens. Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
Score: 90.69578999760206
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, Transformers have emerged as the go-to architecture for both vision and language modeling tasks, but their computational efficiency is limited by the length of the input sequence. To address this, several efficient variants of Transformers have been proposed to accelerate computation or reduce memory consumption while preserving performance. This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation. Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens, bringing several technical contributions: 1) Convolutional activation is used to pre-process the token after patchifying the image to select and rearrange the major tokens and minor tokens, which substantially reduces the computation cost through an additional fusion layer. 2) Instead of using the class activation map of the convolutional model directly, we design a new weighted class activation to lower the model requirements. 3) To facilitate communication between major tokens and fusion tokens, Gated Linear SRA is proposed to further integrate fusion tokens into the attention mechanism. We perform a comprehensive validation of CageViT on the image classification challenge. Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency, while maintaining a comparable level of accuracy (e.g. a moderate-sized 43.35M model trained solely on 224 x 224 ImageNet-1K can achieve Top-1 accuracy of 83.4% accuracy).

Related papers

GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation [30.343504537684755]
Vision Transformers (ViTs) have revolutionized the field of computer vision, yet their deployments on resource-constrained devices remain challenging. To expedite ViTs, token pruning and token merging approaches have been developed, which aim at reducing the number of tokens involved in computation. We introduce a novel Graph-based Token Propagation (GTP) method to resolve the challenge of balancing model efficiency and information preservation for efficient ViTs.
arXiv Detail & Related papers (2023-11-06T11:14:19Z)
HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT) HiViT enjoys both high efficiency and good performance in MIM. In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z)
Three things everyone should know about Vision Transformers [67.30250766591405]
transformer architectures have rapidly gained traction in computer vision. We offer three insights based on simple and easy to implement variants of vision transformers. We evaluate the impact of these design choices using the ImageNet-1k dataset, and confirm our findings on the ImageNet-v2 test set.
arXiv Detail & Related papers (2022-03-18T08:23:03Z)
Multi-Tailed Vision Transformer for Efficient Inference [44.43126137573205]
Vision Transformer (ViT) has achieved promising performance in image recognition. We propose a Multi-Tailed Vision Transformer (MT-ViT) in the paper. MT-ViT adopts multiple tails to produce visual sequences of different lengths for the following Transformer encoder.
arXiv Detail & Related papers (2022-03-03T09:30:55Z)
AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z)
Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer [63.99222215387881]
We propose Evo-ViT, a self-motivated slow-fast token evolution method for vision transformers. Our method can significantly reduce the computational costs of vision transformers while maintaining comparable performance on image classification.
arXiv Detail & Related papers (2021-08-03T09:56:07Z)
Patch Slimming for Efficient Vision Transformers [107.21146699082819]
We study the efficiency problem for visual transformers by excavating redundant calculation in given networks. We present a novel patch slimming approach that discards useless patches in a top-down paradigm. Experimental results on benchmark datasets demonstrate that the proposed method can significantly reduce the computational costs of vision transformers.
arXiv Detail & Related papers (2021-06-05T09:46:00Z)
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features. Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity. Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.