Related papers: D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning

D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning

URL: http://arxiv.org/abs/2512.19443v2
Date: Fri, 26 Dec 2025 04:08:59 GMT
Title: D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning
Authors: Evelyn Zhang, Fufu Yu, Aoqi Wu, Zichen Wen, Ke Yan, Shouhong Ding, Biqing Qi, Linfeng Zhang,
Abstract summary: D2Pruner is a framework that combines debiased importance with a structural pruning mechanism.<n>It reduces FLOPs by 74.2% while retaining 99.2% of its original performance.<n>It marks a significant advancement with up to 63. 53% improvement over existing methods.
Score: 49.16227597771663
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Processing long visual token sequences poses a significant computational burden on Multimodal Large Language Models (MLLMs). While token pruning offers a path to acceleration, we find that current methods, while adequate for general understanding, catastrophically fail on fine-grained localization tasks. We attribute this failure to the inherent flaws of the two prevailing strategies: importance-based methods suffer from a strong positional bias, an inherent model artifact that distracts from semantic content, while diversity-based methods exhibit structural blindness, disregarding the user's prompt and spatial redundancy. To address this, we introduce D2Pruner, a framework that rectifies these issues by uniquely combining debiased importance with a structural pruning mechanism. Our method first secures a core set of the most critical tokens as pivots based on a debiased attention score. It then performs a Maximal Independent Set (MIS) selection on the remaining tokens, which are modeled on a hybrid graph where edges signify spatial proximity and semantic similarity. This process iteratively preserves the most important and available token while removing its neighbors, ensuring that the supplementary tokens are chosen to maximize importance and diversity. Extensive experiments demonstrate that D2Pruner has exceptional efficiency and fidelity. Applied to LLaVA-1.5-7B for general understanding tasks, it reduces FLOPs by 74.2\% while retaining 99.2\% of its original performance. Furthermore, in challenging localization benchmarks with InternVL-2.5-8B, it maintains 85.7\% performance at a 90\% token reduction rate, marking a significant advancement with up to 63. 53\% improvement over existing methods.

Related papers

IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLMs [11.254129271889035]
Visual token pruning has emerged as a critical technique for accelerating MLLM inference.<n>IDPruner achieves state-of-the-art performance and superior generalization across diverse architectures and tasks.
arXiv Detail & Related papers (2026-02-10T11:20:24Z)
CORE: Context-Robust Remasking for Diffusion Language Models [51.59514489363897]
We propose Context-Robust Remasking (CORE), a training-free framework for inference-time revision.<n>Rather than trusting static token probabilities, CORE identifies context-brittle tokens by probing their sensitivity to targeted masked-context perturbations.<n>On LLaDA-8B-Base, CORE delivers consistent improvements across reasoning and code benchmarks, outperforming compute-matched baselines and improving MBPP by up to 9.2 percentage points.
arXiv Detail & Related papers (2026-02-04T00:12:30Z)
Segment-Level Attribution for Selective Learning of Long Reasoning Traces [39.93489058702076]
We propose a segment-level selective learning framework to identify important segments with high attribution strength but moderate consistency.<n>Our approach improves accuracy and output efficiency, enabling more effective learning from long reasoning traces.
arXiv Detail & Related papers (2026-01-31T00:29:24Z)
SEER: Spectral Entropy Encoding of Roles for Context-Aware Attention-Based Design Pattern Detection [0.0]
This paper presents an upgraded version of our prior method Context Is All You Need for detecting Gang of Four (GoF) design patterns from source code.<n> SEER addresses these limitations with two principled additions: (i) a spectral-entropy role encoder that derives per-member role embeddings from the Laplacian spectrum of each class's interaction graph, and (ii) a time-weighted calling context that assigns empirically calibrated duration priors to method categories.<n>We evaluate SEER on PyDesignNet (1,832 files, 35,000 sequences, 23 GoF patterns) and observe consistent gains over our previous system
arXiv Detail & Related papers (2026-01-19T19:13:40Z)
Punctuation-aware Hybrid Trainable Sparse Attention for Large Language Models [44.28116882776357]
We present textbfPunctuation-aware textbfHybrid textbfSparse textbfAttention textbf(PHSA), a trainable sparse attention framework that leverages punctuation tokens as semantic boundary anchors.<n>Specifically, we design a dual-branch aggregation mechanism that fuses global semantic representations with punctuation-enhanced boundary features, preserving the core semantic structure while introducing almost no additional computational overhead.
arXiv Detail & Related papers (2026-01-06T08:47:16Z)
Training-free Context-adaptive Attention for Efficient Long Context Modeling [57.703159205740185]
Training-free Context-adaptive Attention (TCA-Attention) is a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference.<n>TCA-Attention achieves a 2.8$times$ speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention.
arXiv Detail & Related papers (2025-12-10T01:54:57Z)
FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution [3.4666771782038652]
Large language models (LLMs) owe much of their stellar performance to expansive input contexts, yet such verbosity inflates monetary costs, carbon footprint, and inference-time latency.<n>We introduce FrugalPrompt, a novel prompt compression framework for LLMs, which retains only the most semantically significant tokens.<n>We evaluate the approach across four NLP tasks: Sentiment Analysis, Commonsense QA, Summarization, and Mathematical Reasoning.
arXiv Detail & Related papers (2025-10-18T10:22:13Z)
TrimTokenator: Towards Adaptive Visual Token Pruning for Large Multimodal Models [4.779482139419908]
We introduce a mutual information-based token pruning strategy that removes visual tokens semantically with textual tokens.<n>Our method maintains strong performance while reducing textual tokens by 88.9% on models such as LLaVA-15-7B and LLaVA--7B.
arXiv Detail & Related papers (2025-08-30T02:43:50Z)
Reinforcing Video Reasoning with Focused Thinking [65.85683941058916]
We propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity.<n>Specifically, we employ a token weighting mechanism that prioritizes tokens with high informational density.<n>We also reformulate RL training by shifting from single-choice to multi-choice QA tasks.
arXiv Detail & Related papers (2025-05-30T15:42:19Z)
ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training [15.783265191574392]
We introduce ZeroTuning: a training-free method that improves LLM performance by applying head-specific attention adjustments to the initial token.<n>We show theoretically that adding lightweight biases to this token's attention logits monotonically controls the entropy of the downstream attention distribution.<n>We present two variants: a supervised mode that calibrates on validation examples, and a novel unsupervised mode that directly minimizes the model's output entropy.
arXiv Detail & Related papers (2025-05-16T22:52:24Z)
PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection [68.8373788348678]
Visual instruction tuning adapts pre-trained Multimodal Large Language Models to follow human instructions.<n>PRISM is the first training-free framework for efficient visual instruction selection.<n>It reduces the end-to-end time for data selection and model tuning to just 30% of conventional pipelines.
arXiv Detail & Related papers (2025-02-17T18:43:41Z)
PAR: Prompt-Aware Token Reduction Method for Efficient Large Multimodal Models [32.33892531885448]
Multimodal large language models (MLLMs) demonstrate strong performance across visual tasks.<n>But their efficiency is hindered by significant computational and memory demands from processing long contexts in multimodal inputs.<n>We introduce PAR (Prompt-Aware Token Reduction), a novel and plug-and-play approach that reduces visual tokens efficiently without compromising model performance.
arXiv Detail & Related papers (2024-10-09T07:13:22Z)
Mixture Compressor for Mixture-of-Experts LLMs Gains More [71.0473038084673]
We propose a training-free Mixture-Compressor for Mixture-of-Experts large language models (MoE-LLMs)<n>Our MC integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss.<n>For instance, at 2.54 bits, MC compresses 76.6% of the model, with only a 3.8% average accuracy loss.
arXiv Detail & Related papers (2024-10-08T18:09:38Z)
Receptive Multi-granularity Representation for Person Re-Identification [46.99913453669368]
This paper proposes a receptive multi-granularity learning approach to facilitate stripe-based feature learning. By two-branch network architecture, different scales of discriminative identity representation can be learned. Our approach achieves a state-of-the-art accuracy of 96.2%@Rank-1 or 90.0%@mAP on the challenging Market-1501 benchmark.
arXiv Detail & Related papers (2020-08-31T09:26:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.