Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer
- URL: http://arxiv.org/abs/2108.01390v2
- Date: Wed, 4 Aug 2021 13:15:31 GMT
- Title: Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer
- Authors: Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming
Dong, Liqing Zhang, Changsheng Xu, Xing Sun
- Abstract summary: We propose Evo-ViT, a self-motivated slow-fast token evolution method for vision transformers.
Our method can significantly reduce the computational costs of vision transformers while maintaining comparable performance on image classification.
- Score: 63.99222215387881
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Vision transformers have recently received explosive popularity, but the huge
computational cost is still a severe issue. Recent efficient designs for vision
transformers follow two pipelines, namely, structural compression based on
local spatial prior and non-structural token pruning. However, token pruning
breaks the spatial structure that is indispensable for local spatial prior. To
take advantage of both two pipelines, this work seeks to dynamically identify
uninformative tokens for each instance and trim down both the training and
inference complexity while maintaining complete spatial structure and
information flow. To achieve this goal, we propose Evo-ViT, a self-motivated
slow-fast token evolution method for vision transformers. Specifically, we
conduct unstructured instance-wise token selection by taking advantage of the
global class attention that is unique to vision transformers. Then, we propose
to update informative tokens and placeholder tokens that contribute little to
the final prediction with different computational priorities, namely, slow-fast
updating. Thanks to the slow-fast updating mechanism that guarantees
information flow and spatial structure, our Evo-ViT can accelerate vanilla
transformers of both flat and deep-narrow structures from the very beginning of
the training process. Experimental results demonstrate that the proposed method
can significantly reduce the computational costs of vision transformers while
maintaining comparable performance on image classification. For example, our
method accelerates DeiTS by over 60% throughput while only sacrificing 0.4%
top-1 accuracy.
Related papers
- Dynamic Token-Pass Transformers for Semantic Segmentation [22.673910995773262]
We introduce dynamic token-pass vision transformers (DoViT) for semantic segmentation.
DoViT gradually stops partial easy tokens from self-attention calculation and keeps the hard tokens forwarding until meeting the stopping criteria.
Our method greatly reduces about 40% $sim$ 60% FLOPs and the drop of mIoU is within 0.8% for various segmentation transformers.
arXiv Detail & Related papers (2023-08-03T06:14:24Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - What Makes for Good Tokenizers in Vision Transformer? [62.44987486771936]
transformers are capable of extracting their pairwise relationships using self-attention.
What makes for a good tokenizer has not been well understood in computer vision.
Modulation across Tokens (MoTo) incorporates inter-token modeling capability through normalization.
Regularization objective TokenProp is embraced in the standard training regime.
arXiv Detail & Related papers (2022-12-21T15:51:43Z) - Dynamic Spatial Sparsification for Efficient Vision Transformers and
Convolutional Neural Networks [88.77951448313486]
We present a new approach for model acceleration by exploiting spatial sparsity in visual data.
We propose a dynamic token sparsification framework to prune redundant tokens.
We extend our method to hierarchical models including CNNs and hierarchical vision Transformers.
arXiv Detail & Related papers (2022-07-04T17:00:51Z) - DynamicViT: Efficient Vision Transformers with Dynamic Token
Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input.
By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%.
DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z) - Vision Transformers for Dense Prediction [77.34726150561087]
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks.
Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
arXiv Detail & Related papers (2021-03-24T18:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.