Related papers: Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers

Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers

URL: http://arxiv.org/abs/2211.11315v1
Date: Mon, 21 Nov 2022 09:57:11 GMT
Title: Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers
Authors: Sifan Long and Zhen Zhao and Jimin Pi and Shengsheng Wang and Jingdong Wang
Abstract summary: Vision transformers have achieved significant improvements on various vision tasks but their quadratic interactions between tokens significantly reduce computational efficiency. We propose an efficient token decoupling and merging method that can jointly consider the token importance and diversity for token pruning. Our method can even improve the accuracy of DeiT-T by 0.1% after reducing its FLOPs by 40%.
Score: 32.972945618608726
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision transformers have achieved significant improvements on various vision tasks but their quadratic interactions between tokens significantly reduce computational efficiency. Many pruning methods have been proposed to remove redundant tokens for efficient vision transformers recently. However, existing studies mainly focus on the token importance to preserve local attentive tokens but completely ignore the global token diversity. In this paper, we emphasize the cruciality of diverse global semantics and propose an efficient token decoupling and merging method that can jointly consider the token importance and diversity for token pruning. According to the class token attention, we decouple the attentive and inattentive tokens. In addition to preserving the most discriminative local tokens, we merge similar inattentive tokens and match homogeneous attentive tokens to maximize the token diversity. Despite its simplicity, our method obtains a promising trade-off between model complexity and classification accuracy. On DeiT-S, our method reduces the FLOPs by 35% with only a 0.2% accuracy drop. Notably, benefiting from maintaining the token diversity, our method can even improve the accuracy of DeiT-T by 0.1% after reducing its FLOPs by 40%.

Related papers

ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models [59.47738955960352]
ToDRE is a two-stage and training-free token compression framework.<n>It achieves superior performance by pruning tokens based on token Diversity and token-task RElevance.
arXiv Detail & Related papers (2025-05-24T15:47:49Z)
Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models. We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z)
Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification [6.660834045805309]
Pre-trained transformers such as BERT suffer from a computationally expensive self-attention mechanism. We propose integrating two strategies: token pruning and token combining. Experiments with various datasets demonstrate superior performance compared to baseline models.
arXiv Detail & Related papers (2024-06-03T12:51:52Z)
LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation [37.72775203647514]
This paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information and improve inference speed. By employing Dual Cross-Attention (DCA) in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes. Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 times$ speedup, fewer parameters, and competitive performance compared to the baseline models.
arXiv Detail & Related papers (2024-05-16T03:26:06Z)
Token Fusion: Bridging the Gap between Token Pruning and Token Merging [71.84591084401458]
Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs. computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging. We introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging.
arXiv Detail & Related papers (2023-12-02T04:29:19Z)
Multi-Scale And Token Mergence: Make Your ViT More Efficient [3.087140219508349]
Vision Transformer (ViT) has emerged as a prevalent model in the computer vision domain. We propose a novel token pruning method that retains information from non-crucial tokens by merging them with more crucial tokens. Our method achieves a remarkable 33% reduction in computational costs while only incurring a 0.1% decrease in accuracy on DeiT-S.
arXiv Detail & Related papers (2023-06-08T02:58:15Z)
CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation. Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens. Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z)
Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers [34.19166698049552]
Vision Transformers (ViT) have shown their competitive advantages performance-wise compared to convolutional neural networks (CNNs) We propose a novel approach to learn instance-dependent attention patterns, by devising a lightweight connectivity predictor module. We show that our method reduces 48% to 69% FLOPs of MHSA while the accuracy drop is within 0.4%.
arXiv Detail & Related papers (2023-03-24T02:12:28Z)
Robustifying Token Attention for Vision Transformers [72.07710236246285]
Vision transformers (ViTs) still suffer from significant drops in accuracy in the presence of common corruptions. We propose two techniques to make attention more stable through two general techniques. First, our Token-aware Average Pooling (TAP) module encourages the local neighborhood of each token to take part in the attention mechanism. Second, we force the output tokens to aggregate information from a diverse set of input tokens rather than focusing on just a few.
arXiv Detail & Related papers (2023-03-20T14:04:40Z)
Token-Label Alignment for Vision Transformers [93.58540411138164]
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs) We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies. We propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
arXiv Detail & Related papers (2022-10-12T17:54:32Z)
Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost. Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z)
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input. By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%. DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z)
Token-level Adaptive Training for Neural Machine Translation [84.69646428587548]
There exists a token imbalance phenomenon in natural language as different tokens appear with different frequencies. vanilla NMT model usually adopts trivial equal-weighted objectives for target tokens with different frequencies. Low-frequency tokens may carry critical semantic information that will affect the translation quality once they are neglected.
arXiv Detail & Related papers (2020-10-09T05:55:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.