Related papers: Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model

Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model

URL: http://arxiv.org/abs/2411.10803v1
Date: Sat, 16 Nov 2024 13:45:33 GMT
Title: Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model
Authors: Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, Linfeng Zhang,
Abstract summary: We propose Multi-stage Token Dropping (MustDrop) to measure the importance of each token from the whole lifecycle. MustDrop reduces about 88.5% FLOPs on LLaVA with a compression ratio of 92.2% while maintaining comparable accuracy.
Score: 45.01871133425388
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The vision tokens in multimodal large language models usually exhibit significant spatial and temporal redundancy and take up most of the input tokens, which harms their inference efficiency. To solve this problem, some recent works were introduced to drop the unimportant tokens during inference where the importance of each token is decided only by the information in either the vision encoding stage or the prefilling stage. In this paper, we propose Multi-stage Token Dropping (MustDrop) to measure the importance of each token from the whole lifecycle, including the vision encoding stage, prefilling stage, and decoding stage. Concretely, in the visual encoding stage, MustDrop merges spatially adjacent tokens with high similarity, and establishes a key token set to retain the most vision-critical tokens, preventing them from being discarded in later stages. In the prefilling stage, MustDrop further compresses vision tokens by the guidance of text semantics, with a dual-attention filtering strategy. In the decoding stage, an output-aware cache policy is proposed to further reduce the size of the KV cache. By leveraging tailored strategies in the multi-stage process, MustDrop can more precisely recognize the important and redundant tokens, thus achieving an optimal balance between performance and efficiency. For instance, MustDrop reduces about 88.5\% FLOPs on LLaVA with a compression ratio of 92.2\% while maintaining comparable accuracy. Our codes are available at \url{https://github.com/liuting20/MustDrop}.

Related papers

Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment [38.04426918886084]
Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics.<n>Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs)<n>We introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention.
arXiv Detail & Related papers (2025-06-27T14:55:40Z)
ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models [59.47738955960352]
ToDRE is a two-stage and training-free token compression framework.<n>It achieves superior performance by pruning tokens based on token Diversity and token-task RElevance.
arXiv Detail & Related papers (2025-05-24T15:47:49Z)
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model [56.43860351559185]
We introduce textbfTopV, a compatible textbfTOken textbfPruning with inference Time Optimization for fast and low-memory textbfVLM. Our framework incorporates a visual-aware cost function to measure the importance of each source visual token, enabling effective pruning of low-importance tokens.
arXiv Detail & Related papers (2025-03-24T01:47:26Z)
Token Dynamics: Towards Efficient and Dynamic Video Token Representation for Video Large Language Models [50.214593234229255]
We introduce the novel task of extreme short token reduction, aiming to represent extensive video sequences with a minimal number of tokens. We propose Token Dynamics, a new video representation framework that dynamically reduces token count while preserving spatial-temporal coherence. Experiments demonstrate a reduction of token count to merely 0.07% of the original tokens, with only a minor performance drop of 1.13%.
arXiv Detail & Related papers (2025-03-21T09:46:31Z)
Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More [18.928285521147057]
We show that importance is not an ideal indicator to decide whether a token should be pruned. We propose DART (Duplication-Aware Reduction of Tokens), which prunes tokens based on its duplication with other tokens. Experiments demonstrate that DART can prune 88.9% vision tokens while maintaining comparable performance.
arXiv Detail & Related papers (2025-02-17T06:56:28Z)
Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity [24.118503938098307]
textscPoD allocates memory according to token importance.<n>textscPoD reduces KV cache memory usage by up to 35% without compromising performance.
arXiv Detail & Related papers (2024-12-03T08:29:27Z)
FoPru: Focal Pruning for Efficient Large Vision-Language Models [11.36025001578531]
We propose Focal Pruning (FoPru), a training-free method that prunes visual tokens based on the attention-based token significance derived from the vision encoder. Our method can prune a large number of redundant tokens while maintaining high accuracy, leading to significant improvements in inference efficiency.
arXiv Detail & Related papers (2024-11-21T14:22:38Z)
Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. To reduce inference costs, one can either downsize the Large Language Models (LLMs) or reduce the number of input tokens needed to represent the image. We take the first steps toward designing token compression algorithms tailored for high-compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z)
ElasticTok: Adaptive Tokenization for Image and Video [109.75935878130582]
We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens. During inference, ElasticTok can dynamically allocate tokens when needed. Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
arXiv Detail & Related papers (2024-10-10T20:54:15Z)
ToSA: Token Selective Attention for Efficient Vision Transformers [50.13756218204456]
ToSA is a token selective attention approach that can identify tokens that need to be attended as well as those that can skip a transformer layer. We show that ToSA can significantly reduce computation costs while maintaining accuracy on the ImageNet classification benchmark.
arXiv Detail & Related papers (2024-06-13T05:17:21Z)
LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation [37.72775203647514]
This paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information and improve inference speed. By employing Dual Cross-Attention (DCA) in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes. Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 times$ speedup, fewer parameters, and competitive performance compared to the baseline models.
arXiv Detail & Related papers (2024-05-16T03:26:06Z)
Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference [59.91176945361035]
We introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference. Our approach is inspired by two intriguing phenomena we have observed. Our VTW approach can cut computational overhead by over 40% across diverse multimodal tasks while maintaining performance.
arXiv Detail & Related papers (2024-05-09T14:38:53Z)
Object Recognition as Next Token Prediction [99.40793702627396]
We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels.
arXiv Detail & Related papers (2023-12-04T18:58:40Z)
Efficient Streaming Language Models with Attention Sinks [72.20260088848987]
StreamingLLM is an efficient framework that enables Large Language Models to generalize to infinite sequence lengths without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more.
arXiv Detail & Related papers (2023-09-29T17:59:56Z)
Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers [32.972945618608726]
Vision transformers have achieved significant improvements on various vision tasks but their quadratic interactions between tokens significantly reduce computational efficiency. We propose an efficient token decoupling and merging method that can jointly consider the token importance and diversity for token pruning. Our method can even improve the accuracy of DeiT-T by 0.1% after reducing its FLOPs by 40%.
arXiv Detail & Related papers (2022-11-21T09:57:11Z)
Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost. Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.