$\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs
- URL: http://arxiv.org/abs/2510.17205v1
- Date: Mon, 20 Oct 2025 06:40:17 GMT
- Title: $\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs
- Authors: Yingqi Fan, Anhao Zhao, Jinlan Fu, Junlong Tong, Hui Su, Yijie Pan, Wei Zhang, Xiaoyu Shen,
- Abstract summary: We propose emphVisiPruner, a training-free pruning framework that reduces up to 99% of vision-related attention computations and 53.9% of FLOPs on LLaVA-v1.5 7B.<n>Our insights further provide actionable guidelines for training efficient MLLMs by aligning model architecture with its intrinsic layer-wise processing dynamics.
- Score: 26.779915891040236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal tokens. Though efforts have been made to prune tokens in MLLMs, \textit{they lack a fundamental understanding of how MLLMs process and fuse multimodal information.} Through systematic analysis, we uncover a \textbf{three-stage} cross-modal interaction process: (1) Shallow layers recognize task intent, with visual tokens acting as passive attention sinks; (2) Cross-modal fusion occurs abruptly in middle layers, driven by a few critical visual tokens; (3) Deep layers discard vision tokens, focusing solely on linguistic refinement. Based on these findings, we propose \emph{VisiPruner}, a training-free pruning framework that reduces up to 99\% of vision-related attention computations and 53.9\% of FLOPs on LLaVA-v1.5 7B. It significantly outperforms existing token pruning methods and generalizes across diverse MLLMs. Beyond pruning, our insights further provide actionable guidelines for training efficient MLLMs by aligning model architecture with its intrinsic layer-wise processing dynamics. Our code is available at: https://github.com/EIT-NLP/VisiPruner.
Related papers
- ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention [22.397648349603696]
ViCA is a minimal MLLM architecture in which visual tokens bypass all self-attention and feed-forward layers, interacting with text solely through sparse cross-attention at selected layers.<n>ViCA preserves 98% of baseline accuracy while reducing visual-side to 4%, consistently achieving superior performance-efficiency trade-offs.
arXiv Detail & Related papers (2026-02-07T14:46:05Z) - A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models [85.30893355216486]
We study how visual token redundancy evolves with different dMLLM architectures and tasks.<n>Our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks.<n>Layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs.
arXiv Detail & Related papers (2025-11-19T04:13:36Z) - Sparse Training Scheme for Multimodal LLM [26.81140959413325]
Multimodal Large Language Models (MLLMs) have demonstrated outstanding performance across a variety of domains.<n>We propose a novel training-efficient framework based on sparse representations, termed the Sparse Training Scheme (STS)<n>This scheme consists of two key components: the Visual Token, which reduces the information load by compressing visual tokens, and the Layer Dynamic Skipper, which mitigates the computational overhead by skipping unnecessary layers in the language model during both forward and backward passes.
arXiv Detail & Related papers (2025-09-16T11:33:20Z) - CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models [75.88232735646018]
Large Vision-Language Models (LVLMs) process multimodal inputs consisting of text tokens and vision tokens extracted from images or videos.<n>Existing methods attempt to prune redundant vision tokens, revealing substantial redundancy in visual representations.<n>We propose CoViPAL, a layer-wise contextualized visual token pruning method that employs a Plug-and-Play Pruning Module (PPM) to predict and remove redundant vision tokens before they are processed by the LVLM.
arXiv Detail & Related papers (2025-08-24T07:47:00Z) - Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective [6.258220461022373]
Existing Multimodal Large Language Models (MLLMs) process a large number of visual tokens, leading to significant computational costs and inefficiency.<n>We show that token compression is feasible at the input stage of LLM with negligible performance loss.<n>We propose to learn a mapping from the attention map of the first LLM layer to the explanation results, thereby avoiding the need for a full inference pass.
arXiv Detail & Related papers (2025-06-01T17:44:16Z) - ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers [70.38258823378557]
Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens.<n>We introduce a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens.<n>We propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers.
arXiv Detail & Related papers (2025-04-01T07:47:55Z) - ST$^3$: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming [14.937905258757635]
$textbfST3$ is a framework designed to accelerate MLLM inference without retraining.<n>$textbfST3$ can be seamlessly integrated into existing pre-trained MLLMs.
arXiv Detail & Related papers (2024-12-28T10:17:29Z) - Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings [66.04061083611863]
Excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation.<n>We propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE)<n>DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer.
arXiv Detail & Related papers (2024-11-29T11:24:23Z) - Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See [37.7015406019386]
Multimodal Large Language Models (MLLMs) treat visual tokens from visual encoders as text tokens.<n>As token counts grow, the quadratic scaling of computation in LLMs introduces an efficiency bottleneck.<n>In this study, we investigate the redundancy in visual computation at both the parameter and computational pattern levels within LLaVA.
arXiv Detail & Related papers (2024-10-08T16:13:24Z) - Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language.<n>This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok)<n>SeTok groups visual features into semantic units via a dynamic clustering algorithm.<n>The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z) - Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference [59.91176945361035]
We introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference.<n>VTW strategically withdraws vision tokens at a certain layer, enabling only text tokens to engage in subsequent layers.<n>Our approach can cut computational overhead by over 40% across diverse multimodal tasks while maintaining performance.
arXiv Detail & Related papers (2024-05-09T14:38:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.