IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLMs
- URL: http://arxiv.org/abs/2602.13315v1
- Date: Tue, 10 Feb 2026 11:20:24 GMT
- Title: IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLMs
- Authors: Yifan Tan, Yifu Sun, Shirui Huang, Hong Liu, Guanghua Yu, Jianchen Zhu, Yangdong Deng,
- Abstract summary: Visual token pruning has emerged as a critical technique for accelerating MLLM inference.<n>IDPruner achieves state-of-the-art performance and superior generalization across diverse architectures and tasks.
- Score: 11.254129271889035
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities, yet they encounter significant computational bottlenecks due to the massive volume of visual tokens. Consequently, visual token pruning, which substantially reduces the token count, has emerged as a critical technique for accelerating MLLM inference. Existing approaches focus on token importance, diversity, or an intuitive combination of both, without a principled framework for their optimal integration. To address this issue, we first conduct a systematic analysis to characterize the trade-off between token importance and semantic diversity. Guided by this analysis, we propose the \textbf{I}mportance and \textbf{D}iversity Pruner (\textbf{IDPruner}), which leverages the Maximal Marginal Relevance (MMR) algorithm to achieve a Pareto-optimal balance between these two objectives. Crucially, our method operates without requiring attention maps, ensuring full compatibility with FlashAttention and efficient deployment via one-shot pruning. We conduct extensive experiments across various model architectures and multimodal benchmarks, demonstrating that IDPruner achieves state-of-the-art performance and superior generalization across diverse architectures and tasks. Notably, on Qwen2.5-VL-7B-Instruct, IDPruner retains 95.18\% of baseline performance when pruning 75\% of the tokens, and still maintains 86.40\% even under an extreme 90\% pruning ratio. Our code is available at https://github.com/Tencent/AngelSlim.
Related papers
- Nüwa: Mending the Spatial Integrity Torn by VLM Token Pruning [82.39668822222386]
Vision token pruning has proven to be an effective acceleration technique for the efficient Vision Language Model (VLM)<n>We propose $textNwa$, a two-stage token pruning framework that enables efficient feature aggregation while maintaining spatial integrity.<n>Experiments demonstrate that $textNwa$ achieves SOTA performance on multiple VQA benchmarks (from 94% to 95%) and yields substantial improvements on visual grounding tasks (from 7% to 47%)
arXiv Detail & Related papers (2026-02-03T00:51:03Z) - D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning [49.16227597771663]
D2Pruner is a framework that combines debiased importance with a structural pruning mechanism.<n>It reduces FLOPs by 74.2% while retaining 99.2% of its original performance.<n>It marks a significant advancement with up to 63. 53% improvement over existing methods.
arXiv Detail & Related papers (2025-12-22T14:42:31Z) - FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning [16.753299634529736]
Multimodal large language models (MLLMs) have achieved impressive performance, but high-resolution visual inputs result in long sequences of visual tokens and substantial inference latency.<n>Reducing redundant visual tokens is critical to ease computational/memory burdens while preserving performance, enabling MLLM deployment in resource-constrained or latency-sensitive scenarios.<n>We propose Fast Multimodal Mixture-of-Experts (FastMMoE), a training-free acceleration framework for mixture-of-experts (MoE) based MLLMs, developed from a routing analysis perspective.
arXiv Detail & Related papers (2025-11-22T02:25:00Z) - $\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs [26.779915891040236]
We propose emphVisiPruner, a training-free pruning framework that reduces up to 99% of vision-related attention computations and 53.9% of FLOPs on LLaVA-v1.5 7B.<n>Our insights further provide actionable guidelines for training efficient MLLMs by aligning model architecture with its intrinsic layer-wise processing dynamics.
arXiv Detail & Related papers (2025-10-20T06:40:17Z) - MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs [67.75865317787708]
MMG-Vid is a training-free visual token pruning framework for video understanding.<n>We show that MMG-Vid can maintain over 99.5% of the original performance, while effectively reducing 75% visual tokens.
arXiv Detail & Related papers (2025-08-28T17:50:03Z) - IAM: Efficient Inference through Attention Mapping between Different-scale LLMs [74.81417160018856]
IAM framework achieves dual benefits of accelerated attention computation and reduced KV cache usage.<n>We show that IAM can accelerate prefill by 15% and reduce KV cache usage by 22.1% without appreciably sacrificing performance.
arXiv Detail & Related papers (2025-07-16T06:39:11Z) - PAR: Prompt-Aware Token Reduction Method for Efficient Large Multimodal Models [32.33892531885448]
Multimodal large language models (MLLMs) demonstrate strong performance across visual tasks.<n>But their efficiency is hindered by significant computational and memory demands from processing long contexts in multimodal inputs.<n>We introduce PAR (Prompt-Aware Token Reduction), a novel and plug-and-play approach that reduces visual tokens efficiently without compromising model performance.
arXiv Detail & Related papers (2024-10-09T07:13:22Z) - Mixture Compressor for Mixture-of-Experts LLMs Gains More [71.0473038084673]
We propose a training-free Mixture-Compressor for Mixture-of-Experts large language models (MoE-LLMs)<n>Our MC integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss.<n>For instance, at 2.54 bits, MC compresses 76.6% of the model, with only a 3.8% average accuracy loss.
arXiv Detail & Related papers (2024-10-08T18:09:38Z) - Sparsity Meets Similarity: Leveraging Long-Tail Distribution for Dynamic Optimized Token Representation in Multimodal Large Language Models [6.467840081978855]
multimodal large language models (MM-LLMs) have achieved significant success in various tasks.<n>Main computational burden arises from processingd text and visual tokens.<n>We propose a dynamic pruning algorithm that identifies the inflection point in the visual CLS token similarity curve.
arXiv Detail & Related papers (2024-09-02T10:49:10Z) - Semantics-Depth-Symbiosis: Deeply Coupled Semi-Supervised Learning of
Semantics and Depth [83.94528876742096]
We tackle the MTL problem of two dense tasks, ie, semantic segmentation and depth estimation, and present a novel attention module called Cross-Channel Attention Module (CCAM)
In a true symbiotic spirit, we then formulate a novel data augmentation for the semantic segmentation task using predicted depth called AffineMix, and a simple depth augmentation using predicted semantics called ColorAug.
Finally, we validate the performance gain of the proposed method on the Cityscapes dataset, which helps us achieve state-of-the-art results for a semi-supervised joint model based on depth and semantic
arXiv Detail & Related papers (2022-06-21T17:40:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.