No Token Left Behind: Efficient Vision Transformer via Dynamic Token
Idling
- URL: http://arxiv.org/abs/2310.05654v2
- Date: Sun, 5 Nov 2023 07:33:21 GMT
- Title: No Token Left Behind: Efficient Vision Transformer via Dynamic Token
Idling
- Authors: Xuwei Xu, Changlin Li, Yudong Chen, Xiaojun Chang, Jiajun Liu, Sen
Wang
- Abstract summary: Vision Transformers (ViTs) have demonstrated outstanding performance in computer vision tasks.
Various token pruning techniques have been introduced to alleviate the high computational burden of ViTs.
We propose IdleViT, a dynamic token-idle-based method that achieves an excellent trade-off between performance and efficiency.
- Score: 55.203866875294516
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViTs) have demonstrated outstanding performance in
computer vision tasks, yet their high computational complexity prevents their
deployment in computing resource-constrained environments. Various token
pruning techniques have been introduced to alleviate the high computational
burden of ViTs by dynamically dropping image tokens. However, some undesirable
pruning at early stages may result in permanent loss of image information in
subsequent layers, consequently hindering model performance. To address this
problem, we propose IdleViT, a dynamic token-idle-based method that achieves an
excellent trade-off between performance and efficiency. Specifically, in each
layer, IdleViT selects a subset of the image tokens to participate in
computations while keeping the rest of the tokens idle and directly passing
them to this layer's output. By allowing the idle tokens to be re-selected in
the following layers, IdleViT mitigates the negative impact of improper pruning
in the early stages. Furthermore, inspired by the normalized graph cut, we
devise a token cut loss on the attention map as regularization to improve
IdleViT's token selection ability. Our method is simple yet effective and can
be extended to pyramid ViTs since no token is completely dropped. Extensive
experimental results on various ViT architectures have shown that IdleViT can
diminish the complexity of pretrained ViTs by up to 33\% with no more than
0.2\% accuracy decrease on ImageNet, after finetuning for only 30 epochs.
Notably, when the keep ratio is 0.5, IdleViT outperforms the state-of-the-art
EViT on DeiT-S by 0.5\% higher accuracy and even faster inference speed. The
source code is available in the supplementary material.
Related papers
- VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation [18.9885501527331]
Vision Transformers (ViTs) have emerged as the backbone of many segmentation models, consistently achieving state-of-the-art (SOTA) performance.
Image token pruning is one of the most effective strategies to address this complexity.
This work introduces the Vision Language Guided Token Pruning (VLTP), a novel token pruning mechanism that can accelerate ViTbased segmentation models.
arXiv Detail & Related papers (2024-09-13T01:30:24Z) - GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation [30.343504537684755]
Vision Transformers (ViTs) have revolutionized the field of computer vision, yet their deployments on resource-constrained devices remain challenging.
To expedite ViTs, token pruning and token merging approaches have been developed, which aim at reducing the number of tokens involved in computation.
We introduce a novel Graph-based Token Propagation (GTP) method to resolve the challenge of balancing model efficiency and information preservation for efficient ViTs.
arXiv Detail & Related papers (2023-11-06T11:14:19Z) - AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance.
We propose to apply adaptive resolution for different regions in the image according to their importance.
We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z) - Bridging The Gaps Between Token Pruning and Full Pre-training via Masked
Fine-tuning [19.391064062033436]
Dynamic vision transformers are used to accelerate inference by pruning tokens redundant.
Current base models usually adopt full image training, using full images as inputs and keeping the whole feature maps through the forward process.
Inspired by MAE which performs masking and reconstruction self-supervised task, we devise masked fine-tuning to bridge the gaps between pre-trained base models and token pruning based dynamic vision transformers.
arXiv Detail & Related papers (2023-10-26T06:03:18Z) - CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and
Favorable Transferability For ViTs [79.54107547233625]
Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks.
We propose a joint compression method for ViTs that offers both high accuracy and fast inference speed.
Our proposed method can achieve state-of-the-art performance across various ViTs.
arXiv Detail & Related papers (2023-09-27T16:12:07Z) - Multi-Scale And Token Mergence: Make Your ViT More Efficient [3.087140219508349]
Vision Transformer (ViT) has emerged as a prevalent model in the computer vision domain.
We propose a novel token pruning method that retains information from non-crucial tokens by merging them with more crucial tokens.
Our method achieves a remarkable 33% reduction in computational costs while only incurring a 0.1% decrease in accuracy on DeiT-S.
arXiv Detail & Related papers (2023-06-08T02:58:15Z) - Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - Tokens-to-Token ViT: Training Vision Transformers from Scratch on
ImageNet [128.96032932640364]
We propose a new Tokens-To-Token Vision Transformers (T2T-ViT) to solve vision tasks.
T2T-ViT reduces the parameter counts and MACs of vanilla ViT by 200%, while achieving more than 2.5% improvement when trained from scratch on ImageNet.
For example, T2T-ViT with ResNet50 comparable size can achieve 80.7% top-1 accuracy on ImageNet.
arXiv Detail & Related papers (2021-01-28T13:25:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.