Related papers: Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More

Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More

URL: http://arxiv.org/abs/2502.11494v1
Date: Mon, 17 Feb 2025 06:56:28 GMT
Title: Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More
Authors: Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, Linfeng Zhang,
Abstract summary: We show that importance is not an ideal indicator to decide whether a token should be pruned.<n>We propose DART (Duplication-Aware Reduction of Tokens), which prunes tokens based on its duplication with other tokens.<n>Experiments demonstrate that DART can prune 88.9% vision tokens while maintaining comparable performance.
Score: 18.928285521147057
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision tokens in multimodal large language models often dominate huge computational overhead due to their excessive length compared to linguistic modality. Abundant recent methods aim to solve this problem with token pruning, which first defines an importance criterion for tokens and then prunes the unimportant vision tokens during inference. However, in this paper, we show that the importance is not an ideal indicator to decide whether a token should be pruned. Surprisingly, it usually results in inferior performance than random token pruning and leading to incompatibility to efficient attention computation operators.Instead, we propose DART (Duplication-Aware Reduction of Tokens), which prunes tokens based on its duplication with other tokens, leading to significant and training-free acceleration. Concretely, DART selects a small subset of pivot tokens and then retains the tokens with low duplication to the pivots, ensuring minimal information loss during token pruning. Experiments demonstrate that DART can prune 88.9% vision tokens while maintaining comparable performance, leading to a 1.99$\times$ and 2.99$\times$ speed-up in total time and prefilling stage, respectively, with good compatibility to efficient attention operators. Our codes are available at https://github.com/ZichenWen1/DART.

Related papers

Learning to Attribute with Attention [75.61481181755744]
We propose treating attention weights of different attention heads as features. This way, we can learn how to effectively leverage attention weights for attribution. Our resulting method, Attribution with Attention (AT2), reliably performs on par with approaches that involve many ablations.
arXiv Detail & Related papers (2025-04-18T15:36:28Z)
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model [56.43860351559185]
We introduce textbfTopV, a compatible textbfTOken textbfPruning with inference Time Optimization for fast and low-memory textbfVLM. Our framework incorporates a visual-aware cost function to measure the importance of each source visual token, enabling effective pruning of low-importance tokens.
arXiv Detail & Related papers (2025-03-24T01:47:26Z)
Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem? [19.35502303812707]
Multimodal large language models (MLLMs) have shown remarkable performance for cross-modal understanding and generation, yet still suffer from severe inference costs.<n>Recently, abundant works have been proposed to solve this problem with token pruning, which identifies the redundant tokens in MLLMs and then prunes them to reduce the computation and KV storage costs.<n>In this paper, we answer these questions one by one, providing insights into the design of future token pruning methods.
arXiv Detail & Related papers (2025-02-17T07:05:36Z)
Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model [45.01871133425388]
We propose Multi-stage Token Dropping (MustDrop) to measure the importance of each token from the whole lifecycle. MustDrop reduces about 88.5% FLOPs on LLaVA with a compression ratio of 92.2% while maintaining comparable accuracy.
arXiv Detail & Related papers (2024-11-16T13:45:33Z)
LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation [37.72775203647514]
This paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information and improve inference speed. By employing Dual Cross-Attention (DCA) in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes. Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 times$ speedup, fewer parameters, and competitive performance compared to the baseline models.
arXiv Detail & Related papers (2024-05-16T03:26:06Z)
Let's Think Dot by Dot: Hidden Computation in Transformer Language Models [30.972412126012884]
Chain-of-thought responses from language models improve performance across most benchmarks. We show that transformers can use meaningless filler tokens in place of a chain of thought to solve two hard algorithmic tasks. We find that learning to use filler tokens is difficult and requires specific, dense supervision to converge.
arXiv Detail & Related papers (2024-04-24T09:30:00Z)
Tree Cross Attention [59.8891512435847]
Tree Cross Attention (TCA) is a module based on Cross Attention that only retrieves information from a logarithmic $mathcalO(log(N))$ number of tokens for performing inference. We show that TCA performs comparable to Cross Attention across various classification and uncertainty regression tasks while being significantly more token-efficient.
arXiv Detail & Related papers (2023-09-29T16:50:23Z)
Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation [89.88214896713846]
STA score considers two critical factors: temporal redundancy and semantic importance. We apply the STA module to off-the-shelf video Transformers and Videowins. Results: Kinetics-400 and Something-Something V2 achieve 30% overshelf reduction with a negligible 0.2% accuracy drop.
arXiv Detail & Related papers (2023-08-08T19:38:15Z)
Revisiting Token Pruning for Object Detection and Instance Segmentation [25.3324628669201]
We investigate token pruning to accelerate inference for object and instance segmentation. We show a reduction in performance decline from 1.5 mAP to 0.3 mAP in both boxes and masks, compared to existing token pruning methods.
arXiv Detail & Related papers (2023-06-12T11:55:33Z)
Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens [65.4435926060951]
We propose to significantly improve the efficiency of Transformers for ultra long sequences, by compressing the sequence into a much smaller representation at each layer. Our algorithm is not only efficient (achieving more than $3times$ efficiency gain compared to baselines on 4K and 16K lengths) but also offers competitive/better performance on a large number of tasks.
arXiv Detail & Related papers (2023-05-07T10:32:18Z)
Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost. Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.