SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs
- URL: http://arxiv.org/abs/2510.24214v1
- Date: Tue, 28 Oct 2025 09:29:37 GMT
- Title: SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs
- Authors: Jinhong Deng, Wen Li, Joey Tianyi Zhou, Yang He,
- Abstract summary: We propose a novel visual token pruning strategy called textbfSaliency-textbfCoverage textbfOriented token textbfPruning for textbfEfficient MLLMs.
- Score: 59.415473779171315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Large Language Models (MLLMs) typically process a large number of visual tokens, leading to considerable computational overhead, even though many of these tokens are redundant. Existing visual token pruning methods primarily focus on selecting the most salient tokens based on attention scores, resulting in the semantic incompleteness of the selected tokens. In this paper, we propose a novel visual token pruning strategy, called \textbf{S}aliency-\textbf{C}overage \textbf{O}riented token \textbf{P}runing for \textbf{E}fficient MLLMs (SCOPE), to jointly model both the saliency and coverage of the selected visual tokens to better preserve semantic completeness. Specifically, we introduce a set-coverage for a given set of selected tokens, computed based on the token relationships. We then define a token-coverage gain for each unselected token, quantifying how much additional coverage would be obtained by including it. By integrating the saliency score into the token-coverage gain, we propose our SCOPE score and iteratively select the token with the highest SCOPE score. We conduct extensive experiments on multiple vision-language understanding benchmarks using the LLaVA-1.5 and LLaVA-Next models. Experimental results demonstrate that our method consistently outperforms prior approaches. Our code is available at \href{https://github.com/kinredon/SCOPE}{https://github.com/kinredon/SCOPE}.
Related papers
- When LLaVA Meets Objects: Token Composition for Vision-Language-Models [31.554057603168214]
Mask-LLaVA is a framework that leverages different levels of visual features to create a compact yet information-rich visual representation for autoregressive Vision Language Models.<n>While all tokens are used during training, it shows that the resulting model can flexibly drop especially the number of mask-based object-tokens at test time.<n>Our analysis demonstrates that combining multi-level features enables efficient learning with fewer tokens while allowing dynamic token selection at test time for good performance.
arXiv Detail & Related papers (2026-02-04T18:50:46Z) - TrimTokenator: Towards Adaptive Visual Token Pruning for Large Multimodal Models [4.779482139419908]
We introduce a mutual information-based token pruning strategy that removes visual tokens semantically with textual tokens.<n>Our method maintains strong performance while reducing textual tokens by 88.9% on models such as LLaVA-15-7B and LLaVA--7B.
arXiv Detail & Related papers (2025-08-30T02:43:50Z) - VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference [76.00113788838334]
Group-wise textbfVIsual token textbfSelection and textbfAggregation (VISA)<n>Our method can preserve more visual information while compressing visual tokens.<n>We conduct comprehensive experiments on LLaVA-1.5, LLaVA-NeXT, and Video-LLaVA across various benchmarks to validate the efficacy of VISA.
arXiv Detail & Related papers (2025-08-25T10:07:07Z) - Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs [30.97955016203357]
In multimodal large language models, the length of input visual tokens is often significantly greater than that of their textual counterparts.<n>We propose a novel visual token pruning method named CDPruner, which maximizes the conditional diversity of retained tokens.<n>Our experiments show that CDPruner establishes new state-of-the-art on various vision-based benchmarks.
arXiv Detail & Related papers (2025-06-12T17:59:09Z) - Window Token Concatenation for Efficient Visual Large Language Models [59.6094005814282]
We propose Window Token Concatenation (WiCo) to reduce visual tokens in Visual Large Language Models (VLLMs)<n>WiCo group diverse tokens into one, and thus obscure some fine details.<n>We perform extensive experiments on both coarse- and fine-grained visual understanding tasks based on LLaVA-1.5 and Shikra, showing better performance compared with existing token reduction projectors.
arXiv Detail & Related papers (2025-04-05T02:32:58Z) - VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models [35.38573641029626]
We introduce the novel task of Extreme Short Token Reduction, which aims to represent entire videos using a minimal set of discrete tokens.<n>On the Extreme Short Token Reduction task, our VQToken compresses sequences to just 0.07 percent of their original length while incurring only a 0.66 percent drop in accuracy on the NextQA-MC benchmark.
arXiv Detail & Related papers (2025-03-21T09:46:31Z) - DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models [13.519389777060226]
Adding visual tokens to Large Multimodal Models (LMMs) increases the total token count, often by thousands.<n>To address this issue, token pruning methods, which remove part of the visual tokens, are proposed.<n>The proposed method, DivPrune, reduces redundancy and achieves the highest diversity of the selected tokens.
arXiv Detail & Related papers (2025-03-04T01:33:14Z) - Enhancing Item Tokenization for Generative Recommendation through Self-Improvement [67.94240423434944]
Generative recommendation systems are driven by large language models (LLMs)<n>Current item tokenization methods include using text descriptions, numerical strings, or sequences of discrete tokens.<n>We propose a self-improving item tokenization method that allows the LLM to refine its own item tokenizations during training process.
arXiv Detail & Related papers (2024-12-22T21:56:15Z) - ElasticTok: Adaptive Tokenization for Image and Video [109.75935878130582]
We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens.<n>During inference, ElasticTok can dynamically allocate tokens when needed.<n>Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
arXiv Detail & Related papers (2024-10-10T20:54:15Z) - Sparsity Meets Similarity: Leveraging Long-Tail Distribution for Dynamic Optimized Token Representation in Multimodal Large Language Models [6.467840081978855]
multimodal large language models (MM-LLMs) have achieved significant success in various tasks.<n>Main computational burden arises from processingd text and visual tokens.<n>We propose a dynamic pruning algorithm that identifies the inflection point in the visual CLS token similarity curve.
arXiv Detail & Related papers (2024-09-02T10:49:10Z) - Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding [54.532578213126065]
Most document understanding methods preserve all tokens within sub-images and treat them equally.
This neglects their different informativeness and leads to a significant increase in the number of image tokens.
We propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing.
arXiv Detail & Related papers (2024-07-19T16:11:15Z) - Leveraging per Image-Token Consistency for Vision-Language Pre-training [52.825150269820696]
Cross-modal masked language modeling (CMLM) is insufficient for vision-language pre-training.
We propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training)
The proposed EPIC method is easily combined with pre-training methods.
arXiv Detail & Related papers (2022-11-20T12:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.