Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization
- URL: http://arxiv.org/abs/2505.22038v2
- Date: Thu, 23 Oct 2025 12:39:42 GMT
- Title: Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization
- Authors: Kaiyuan Li, Xiaoyue Chen, Chen Gao, Yong Li, Xinlei Chen,
- Abstract summary: Large Vision-Language Models (LVLMs) have shown impressive performance across multi-modal tasks by encoding images into thousands of tokens.<n>Previous approaches have attempted to reduce the number of image tokens through token pruning.<n>We propose Balanced Token Pruning (BTP), a plug-and-play method for pruning vision tokens.
- Score: 30.73986620551153
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Vision-Language Models (LVLMs) have shown impressive performance across multi-modal tasks by encoding images into thousands of tokens. However, the large number of image tokens results in significant computational overhead, and the use of dynamic high-resolution inputs further increases this burden. Previous approaches have attempted to reduce the number of image tokens through token pruning, typically by selecting tokens based on attention scores or image token diversity. Through empirical studies, we observe that existing methods often overlook the joint impact of pruning on both the current layer's output (local) and the outputs of subsequent layers (global), leading to suboptimal pruning decisions. To address this challenge, we propose Balanced Token Pruning (BTP), a plug-and-play method for pruning vision tokens. Specifically, our method utilizes a small calibration set to divide the pruning process into multiple stages. In the early stages, our method emphasizes the impact of pruning on subsequent layers, whereas in the deeper stages, the focus shifts toward preserving the consistency of local outputs. Extensive experiments across various LVLMs demonstrate the broad effectiveness of our approach on multiple benchmarks. Our method achieves a 78% compression rate while preserving 96.7% of the original models' performance on average. Our code is available at https://github.com/EmbodiedCity/NeurIPS2025-Balanced-Token-Pruning.
Related papers
- AdaTok: Adaptive Token Compression with Object-Aware Representations for Efficient Multimodal LLMs [29.68162972167947]
We propose an object-level token merging strategy for Adaptive Token compression.<n>Our approach averagely, utilizes only 10% tokens while achieving almost 96% of the vanilla model's performance.
arXiv Detail & Related papers (2025-11-18T06:12:15Z) - Don't Just Chase "Highlighted Tokens" in MLLMs: Revisiting Visual Holistic Context Retention [50.97683288777336]
Multimodal Large Language Models (MLLMs) suffer from considerable computational overhead due to their reliance on massive visual tokens.<n>Recent studies have explored token pruning to alleviate this problem, which typically uses text-vision cross-attention.<n>We propose HoloV, a plug-and-play visual token pruning framework for efficient inference.
arXiv Detail & Related papers (2025-10-03T11:33:40Z) - HIVTP: A Training-Free Method to Improve VLMs Efficiency via Hierarchical Visual Token Pruning Using Middle-Layer-Based Importance Score [14.857585045577165]
HIVTP is a training-free method to improve Vision-Language Models (VLMs) inference efficiency.<n>We propose a hierarchical visual token pruning method to retain both globally and locally important visual tokens.<n> Experimental results show that our proposed method, HIVTP, can reduce the time-to-first-token (TTFT) of LLaVA-v1.5-7B and LLaVA-Next-7B by up to 50.0% and 55.1%, respectively.
arXiv Detail & Related papers (2025-09-28T05:53:39Z) - Pyramid Token Pruning for High-Resolution Large Vision-Language Models via Region, Token, and Instruction-Guided Importance [60.028070589466445]
Pyramid Token Pruning (PTP) is a training-free strategy that hierarchically integrates bottom-up visual saliency at both region and token levels with top-down instruction-guided relevance.<n>We show that PTP substantially reduces computational cost, memory usage, and inference latency, with negligible performance degradation.
arXiv Detail & Related papers (2025-09-19T07:28:17Z) - CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning [15.733788584792388]
We propose Contextually Adaptive Token Pruning (CATP), a training-free pruning method targeted at multimodal in-context learning (ICL)<n>After removing 77.8% of the image tokens, CATP produces an average performance gain of 0.6% over the vanilla model on four LVLMs and eight benchmarks.<n>It effectively improves efficiency by achieving an average reduction of 10.78% in latency.
arXiv Detail & Related papers (2025-08-11T11:41:51Z) - A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models [94.49953824684853]
We introduce a dynamic pruning framework, GlimpsePrune, inspired by human cognition.<n>It takes a data-driven ''glimpse'' and prunes irrelevant visual tokens in a single forward pass before answer generation.<n>An enhanced GlimpsePrune+ achieves 110% of the baseline performance while maintaining a similarly high pruning rate.
arXiv Detail & Related papers (2025-08-03T02:15:43Z) - VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning [95.89543460132413]
Vision-language models (VLMs) have improved performance by increasing the number of visual tokens.<n>However, most real-world scenarios do not require such an extensive number of visual tokens.<n>We present a new paradigm for visual token compression, namely, VisionThink.
arXiv Detail & Related papers (2025-07-17T17:59:55Z) - GreedyPrune: Retenting Critical Visual Token Set for Large Vision Language Models [5.025353943896242]
GreedyPrune is a training-free visual token pruning algorithm designed to optimize semantic saliency and visual diversity.<n>We show that GreedyPrune achieves state-of-the-art accuracy across various multimodal tasks and models while significantly reducing end-to-end inference latency.
arXiv Detail & Related papers (2025-06-16T07:21:11Z) - ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models [59.47738955960352]
ToDRE is a two-stage and training-free token compression framework.<n>It achieves superior performance by pruning tokens based on token Diversity and token-task RElevance.
arXiv Detail & Related papers (2025-05-24T15:47:49Z) - LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation [10.614327633823462]
Large Vision Language Models (LVLMs) have been widely adopted to guide vision foundation models in performing reasoning segmentation tasks.<n>In this work, we propose LVLM_CSP, a novel training free visual token pruning method specifically designed for LVLM based reasoning segmentation tasks.
arXiv Detail & Related papers (2025-04-15T04:27:15Z) - Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models [85.51753014478315]
We introduce AdaptPrune, a novel plug-and-play training-free pruning method.<n>It builds on conventional attention-based pruning by integrating spatial distance and token similarity with an adaptive NMS approach.<n>Our approach ensures a comprehensive evaluation of token importance and substantially refines the pruning decisions.
arXiv Detail & Related papers (2025-03-11T03:58:17Z) - FoPru: Focal Pruning for Efficient Large Vision-Language Models [11.36025001578531]
We propose Focal Pruning (FoPru), a training-free method that prunes visual tokens based on the attention-based token significance derived from the vision encoder.
Our method can prune a large number of redundant tokens while maintaining high accuracy, leading to significant improvements in inference efficiency.
arXiv Detail & Related papers (2024-11-21T14:22:38Z) - Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks.<n>To reduce inference costs, one can either downsize the Large Language Models (LLMs) or reduce the number of input tokens needed to represent the image.<n>We take the first steps toward designing token compression algorithms tailored for high-compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z) - Sparsity Meets Similarity: Leveraging Long-Tail Distribution for Dynamic Optimized Token Representation in Multimodal Large Language Models [6.467840081978855]
multimodal large language models (MM-LLMs) have achieved significant success in various tasks.<n>Main computational burden arises from processingd text and visual tokens.<n>We propose a dynamic pruning algorithm that identifies the inflection point in the visual CLS token similarity curve.
arXiv Detail & Related papers (2024-09-02T10:49:10Z) - Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding [54.532578213126065]
Most document understanding methods preserve all tokens within sub-images and treat them equally.
This neglects their different informativeness and leads to a significant increase in the number of image tokens.
We propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing.
arXiv Detail & Related papers (2024-07-19T16:11:15Z) - ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens [75.09406436851445]
We propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs.
Our experiments demonstrate that with the removal of 30$%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance.
arXiv Detail & Related papers (2023-09-28T05:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.