GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation
- URL: http://arxiv.org/abs/2311.03035v2
- Date: Mon, 8 Jan 2024 03:42:25 GMT
- Title: GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation
- Authors: Xuwei Xu, Sen Wang, Yudong Chen, Yanping Zheng, Zhewei Wei, Jiajun Liu
- Abstract summary: Vision Transformers (ViTs) have revolutionized the field of computer vision, yet their deployments on resource-constrained devices remain challenging.
To expedite ViTs, token pruning and token merging approaches have been developed, which aim at reducing the number of tokens involved in computation.
We introduce a novel Graph-based Token Propagation (GTP) method to resolve the challenge of balancing model efficiency and information preservation for efficient ViTs.
- Score: 30.343504537684755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViTs) have revolutionized the field of computer vision,
yet their deployments on resource-constrained devices remain challenging due to
high computational demands. To expedite pre-trained ViTs, token pruning and
token merging approaches have been developed, which aim at reducing the number
of tokens involved in the computation. However, these methods still have some
limitations, such as image information loss from pruned tokens and inefficiency
in the token-matching process. In this paper, we introduce a novel Graph-based
Token Propagation (GTP) method to resolve the challenge of balancing model
efficiency and information preservation for efficient ViTs. Inspired by graph
summarization algorithms, GTP meticulously propagates less significant tokens'
information to spatially and semantically connected tokens that are of greater
importance. Consequently, the remaining few tokens serve as a summarization of
the entire token graph, allowing the method to reduce computational complexity
while preserving essential information of eliminated tokens. Combined with an
innovative token selection strategy, GTP can efficiently identify image tokens
to be propagated. Extensive experiments have validated GTP's effectiveness,
demonstrating both efficiency and performance improvements. Specifically, GTP
decreases the computational complexity of both DeiT-S and DeiT-B by up to 26%
with only a minimal 0.3% accuracy drop on ImageNet-1K without finetuning, and
remarkably surpasses the state-of-the-art token merging method on various
backbones at an even faster inference speed. The source code is available at
https://github.com/Ackesnal/GTP-ViT.
Related papers
- LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation [37.72775203647514]
This paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information and improve inference speed.
By employing Dual Cross-Attention (DCA) in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes.
Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 times$ speedup, fewer parameters, and competitive performance compared to the baseline models.
arXiv Detail & Related papers (2024-05-16T03:26:06Z) - AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance.
We propose to apply adaptive resolution for different regions in the image according to their importance.
We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z) - No Token Left Behind: Efficient Vision Transformer via Dynamic Token
Idling [55.203866875294516]
Vision Transformers (ViTs) have demonstrated outstanding performance in computer vision tasks.
Various token pruning techniques have been introduced to alleviate the high computational burden of ViTs.
We propose IdleViT, a dynamic token-idle-based method that achieves an excellent trade-off between performance and efficiency.
arXiv Detail & Related papers (2023-10-09T12:10:41Z) - PPT: Token Pruning and Pooling for Efficient Vision Transformers [7.792045532428676]
We propose a novel acceleration framework, namely token Pruning & Pooling Transformers (PPT)
PPT integrates both token pruning and token pooling techniques in ViTs without additional trainable parameters.
It reduces over 37% FLOPs and improves the throughput by over 45% for DeiT-S without any accuracy drop on the ImageNet dataset.
arXiv Detail & Related papers (2023-10-03T05:55:11Z) - Multi-Scale And Token Mergence: Make Your ViT More Efficient [3.087140219508349]
Vision Transformer (ViT) has emerged as a prevalent model in the computer vision domain.
We propose a novel token pruning method that retains information from non-crucial tokens by merging them with more crucial tokens.
Our method achieves a remarkable 33% reduction in computational costs while only incurring a 0.1% decrease in accuracy on DeiT-S.
arXiv Detail & Related papers (2023-06-08T02:58:15Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - Making Vision Transformers Efficient from A Token Sparsification View [26.42498120556985]
We propose a novel Semantic Token ViT (STViT) for efficient global and local vision transformers.
Our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone.
In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks.
arXiv Detail & Related papers (2023-03-15T15:12:36Z) - Localized Contrastive Learning on Graphs [110.54606263711385]
We introduce a simple yet effective contrastive model named Localized Graph Contrastive Learning (Local-GCL)
In spite of its simplicity, Local-GCL achieves quite competitive performance in self-supervised node representation learning tasks on graphs with various scales and properties.
arXiv Detail & Related papers (2022-12-08T23:36:00Z) - Token-Label Alignment for Vision Transformers [93.58540411138164]
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs)
We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies.
We propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
arXiv Detail & Related papers (2022-10-12T17:54:32Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.