DynamicViT: Efficient Vision Transformers with Dynamic Token
Sparsification
- URL: http://arxiv.org/abs/2106.02034v1
- Date: Thu, 3 Jun 2021 17:57:41 GMT
- Title: DynamicViT: Efficient Vision Transformers with Dynamic Token
Sparsification
- Authors: Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, Cho-Jui
Hsieh
- Abstract summary: We propose a dynamic token sparsification framework to prune redundant tokens based on the input.
By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%.
DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
- Score: 134.9393799043401
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention is sparse in vision transformers. We observe the final prediction
in vision transformers is only based on a subset of most informative tokens,
which is sufficient for accurate image recognition. Based on this observation,
we propose a dynamic token sparsification framework to prune redundant tokens
progressively and dynamically based on the input. Specifically, we devise a
lightweight prediction module to estimate the importance score of each token
given the current features. The module is added to different layers to prune
redundant tokens hierarchically. To optimize the prediction module in an
end-to-end manner, we propose an attention masking strategy to differentiably
prune a token by blocking its interactions with other tokens. Benefiting from
the nature of self-attention, the unstructured sparse tokens are still hardware
friendly, which makes our framework easy to achieve actual speed-up. By
hierarchically pruning 66% of the input tokens, our method greatly reduces
31%~37% FLOPs and improves the throughput by over 40% while the drop of
accuracy is within 0.5% for various vision transformers. Equipped with the
dynamic token sparsification framework, DynamicViT models can achieve very
competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs
and vision transformers on ImageNet. Code is available at
https://github.com/raoyongming/DynamicViT
Related papers
- AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance.
We propose to apply adaptive resolution for different regions in the image according to their importance.
We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z) - Bridging The Gaps Between Token Pruning and Full Pre-training via Masked
Fine-tuning [19.391064062033436]
Dynamic vision transformers are used to accelerate inference by pruning tokens redundant.
Current base models usually adopt full image training, using full images as inputs and keeping the whole feature maps through the forward process.
Inspired by MAE which performs masking and reconstruction self-supervised task, we devise masked fine-tuning to bridge the gaps between pre-trained base models and token pruning based dynamic vision transformers.
arXiv Detail & Related papers (2023-10-26T06:03:18Z) - Dynamic Token-Pass Transformers for Semantic Segmentation [22.673910995773262]
We introduce dynamic token-pass vision transformers (DoViT) for semantic segmentation.
DoViT gradually stops partial easy tokens from self-attention calculation and keeps the hard tokens forwarding until meeting the stopping criteria.
Our method greatly reduces about 40% $sim$ 60% FLOPs and the drop of mIoU is within 0.8% for various segmentation transformers.
arXiv Detail & Related papers (2023-08-03T06:14:24Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - Making Vision Transformers Efficient from A Token Sparsification View [26.42498120556985]
We propose a novel Semantic Token ViT (STViT) for efficient global and local vision transformers.
Our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone.
In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks.
arXiv Detail & Related papers (2023-03-15T15:12:36Z) - Dynamic Spatial Sparsification for Efficient Vision Transformers and
Convolutional Neural Networks [88.77951448313486]
We present a new approach for model acceleration by exploiting spatial sparsity in visual data.
We propose a dynamic token sparsification framework to prune redundant tokens.
We extend our method to hierarchical models including CNNs and hierarchical vision Transformers.
arXiv Detail & Related papers (2022-07-04T17:00:51Z) - Attribute Surrogates Learning and Spectral Tokens Pooling in
Transformers for Few-shot Learning [50.95116994162883]
Vision transformers have been thought of as a promising alternative to convolutional neural networks for visual recognition.
This paper presents hierarchically cascaded transformers that exploit intrinsic image structures through spectral tokens pooling.
HCTransformers surpass the DINO baseline by a large margin of 9.7% 5-way 1-shot accuracy and 9.17% 5-way 5-shot accuracy on miniImageNet.
arXiv Detail & Related papers (2022-03-17T03:49:58Z) - Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer [63.99222215387881]
We propose Evo-ViT, a self-motivated slow-fast token evolution method for vision transformers.
Our method can significantly reduce the computational costs of vision transformers while maintaining comparable performance on image classification.
arXiv Detail & Related papers (2021-08-03T09:56:07Z) - Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with
Adaptive Sequence Length [40.35853878334764]
Vision Transformers (ViT) have achieved remarkable success in large-scale image recognition.
To achieve a decent trade-off between accuracy and speed, the number of tokens is empirically set to 16x16.
We propose a Dynamic Transformer to automatically configure a proper number of tokens for each input image.
arXiv Detail & Related papers (2021-05-31T16:04:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.