Related papers: Multi-Scale And Token Mergence: Make Your ViT More Efficient

Multi-Scale And Token Mergence: Make Your ViT More Efficient

URL: http://arxiv.org/abs/2306.04897v2
Date: Sat, 22 Jul 2023 07:34:08 GMT
Title: Multi-Scale And Token Mergence: Make Your ViT More Efficient
Authors: Zhe Bian, Zhe Wang, Wenqiang Han, Kangping Wang
Abstract summary: Vision Transformer (ViT) has emerged as a prevalent model in the computer vision domain. We propose a novel token pruning method that retains information from non-crucial tokens by merging them with more crucial tokens. Our method achieves a remarkable 33% reduction in computational costs while only incurring a 0.1% decrease in accuracy on DeiT-S.
Score: 3.087140219508349
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Since its inception, Vision Transformer (ViT) has emerged as a prevalent model in the computer vision domain. Nonetheless, the multi-head self-attention (MHSA) mechanism in ViT is computationally expensive due to its calculation of relationships among all tokens. Although some techniques mitigate computational overhead by discarding tokens, this also results in the loss of potential information from those tokens. To tackle these issues, we propose a novel token pruning method that retains information from non-crucial tokens by merging them with more crucial tokens, thereby mitigating the impact of pruning on model performance. Crucial and non-crucial tokens are identified by their importance scores and merged based on similarity scores. Furthermore, multi-scale features are exploited to represent images, which are fused prior to token pruning to produce richer feature representations. Importantly, our method can be seamlessly integrated with various ViTs, enhancing their adaptability. Experimental evidence substantiates the efficacy of our approach in reducing the influence of token pruning on model performance. For instance, on the ImageNet dataset, it achieves a remarkable 33% reduction in computational costs while only incurring a 0.1% decrease in accuracy on DeiT-S.

Related papers

Block-based Symmetric Pruning and Fusion for Efficient Vision Transformers [11.916258576313776]
Vision Transformer (ViT) has achieved impressive results across various vision tasks.<n>Recent methods have aimed to reduce ViT's $O(n2)$ complexity by pruning unimportant tokens.<n>We introduce a novel bf Block-based Symmetric Pruning and Fusion for efficient ViT.
arXiv Detail & Related papers (2025-07-16T10:48:56Z)
ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models [59.47738955960352]
ToDRE is a two-stage and training-free token compression framework.<n>It achieves superior performance by pruning tokens based on token Diversity and token-task RElevance.
arXiv Detail & Related papers (2025-05-24T15:47:49Z)
When Less is Enough: Adaptive Token Reduction for Efficient Image Representation [2.2120851074630177]
We introduce a new method for determining feature utility based on the idea that less valuable features can be reconstructed from more valuable ones. We implement this concept by integrating an autoencoder with a Gumbel-Softmax selection mechanism. Our results highlight a promising direction towards adaptive and efficient multimodal pruning.
arXiv Detail & Related papers (2025-03-20T19:17:08Z)
LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation [37.72775203647514]
This paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information and improve inference speed. By employing Dual Cross-Attention (DCA) in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes. Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 times$ speedup, fewer parameters, and competitive performance compared to the baseline models.
arXiv Detail & Related papers (2024-05-16T03:26:06Z)
Token Fusion: Bridging the Gap between Token Pruning and Token Merging [71.84591084401458]
Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs. computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging. We introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging.
arXiv Detail & Related papers (2023-12-02T04:29:19Z)
GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation [30.343504537684755]
Vision Transformers (ViTs) have revolutionized the field of computer vision, yet their deployments on resource-constrained devices remain challenging. To expedite ViTs, token pruning and token merging approaches have been developed, which aim at reducing the number of tokens involved in computation. We introduce a novel Graph-based Token Propagation (GTP) method to resolve the challenge of balancing model efficiency and information preservation for efficient ViTs.
arXiv Detail & Related papers (2023-11-06T11:14:19Z)
AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. We propose to apply adaptive resolution for different regions in the image according to their importance. We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z)
No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling [55.203866875294516]
Vision Transformers (ViTs) have demonstrated outstanding performance in computer vision tasks. Various token pruning techniques have been introduced to alleviate the high computational burden of ViTs. We propose IdleViT, a dynamic token-idle-based method that achieves an excellent trade-off between performance and efficiency.
arXiv Detail & Related papers (2023-10-09T12:10:41Z)
Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers [2.0442992958844517]
We propose a novel Token Pruning & Squeezing module (TPS) for compressing vision transformers with higher efficiency. TPS squeezes the information of pruned tokens into partial reserved tokens via the unidirectional nearest-neighbor matching and similarity-based fusing steps. Our method can accelerate the throughput of DeiT-small beyond DeiT-tiny, while its accuracy surpasses DeiT-tiny by 4.78%.
arXiv Detail & Related papers (2023-04-21T02:59:30Z)
Robustifying Token Attention for Vision Transformers [72.07710236246285]
Vision transformers (ViTs) still suffer from significant drops in accuracy in the presence of common corruptions. We propose two techniques to make attention more stable through two general techniques. First, our Token-aware Average Pooling (TAP) module encourages the local neighborhood of each token to take part in the attention mechanism. Second, we force the output tokens to aggregate information from a diverse set of input tokens rather than focusing on just a few.
arXiv Detail & Related papers (2023-03-20T14:04:40Z)
Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers [32.972945618608726]
Vision transformers have achieved significant improvements on various vision tasks but their quadratic interactions between tokens significantly reduce computational efficiency. We propose an efficient token decoupling and merging method that can jointly consider the token importance and diversity for token pruning. Our method can even improve the accuracy of DeiT-T by 0.1% after reducing its FLOPs by 40%.
arXiv Detail & Related papers (2022-11-21T09:57:11Z)
Token-Label Alignment for Vision Transformers [93.58540411138164]
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs) We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies. We propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
arXiv Detail & Related papers (2022-10-12T17:54:32Z)
Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost. Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z)
Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks. We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer. The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.