A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models
- URL: http://arxiv.org/abs/2508.01548v1
- Date: Sun, 03 Aug 2025 02:15:43 GMT
- Title: A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models
- Authors: Quan-Sheng Zeng, Yunheng Li, Qilong Wang, Peng-Tao Jiang, Zuxuan Wu, Ming-Ming Cheng, Qibin Hou,
- Abstract summary: We introduce a dynamic pruning framework, GlimpsePrune, inspired by human cognition.<n>It takes a data-driven ''glimpse'' and prunes irrelevant visual tokens in a single forward pass before answer generation.<n>An enhanced GlimpsePrune+ achieves 110% of the baseline performance while maintaining a similarly high pruning rate.
- Score: 94.49953824684853
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual token compression is critical for Large Vision-Language Models (LVLMs) to efficiently process high-resolution inputs. Existing methods that typically adopt fixed compression ratios cannot adapt to scenes of varying complexity, often causing imprecise pruning that discards informative visual tokens and results in degraded model performance. To address this issue, we introduce a dynamic pruning framework, GlimpsePrune, inspired by human cognition. It takes a data-driven ''glimpse'' and prunes irrelevant visual tokens in a single forward pass before answer generation. This approach prunes 92.6% of visual tokens while on average fully retaining the baseline performance on free-form VQA tasks. The reduced computational cost also enables more effective fine-tuning: an enhanced GlimpsePrune+ achieves 110% of the baseline performance while maintaining a similarly high pruning rate. Our work paves a new way for building more powerful and efficient LVLMs.
Related papers
- VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning [95.89543460132413]
Vision-language models (VLMs) have improved performance by increasing the number of visual tokens.<n>However, most real-world scenarios do not require such an extensive number of visual tokens.<n>We present a new paradigm for visual token compression, namely, VisionThink.
arXiv Detail & Related papers (2025-07-17T17:59:55Z) - DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs [124.52164183968145]
We present DyMU, an efficient, training-free framework that reduces the computational burden of vision-language models (VLMs)<n>Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity.<n>Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence.
arXiv Detail & Related papers (2025-04-23T18:38:18Z) - Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping [13.846838416902575]
A key bottleneck stems from the proliferation of visual tokens required for fine-grained image understanding.<n>We propose Skip-Vision, a unified framework addressing both training and inference inefficiencies in vision-language models.<n> Experimental results demonstrate that Skip-Vision reduces training time by up to 35%, inference FLOPs by 75%, and latency by 45%.
arXiv Detail & Related papers (2025-03-26T04:16:48Z) - TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model [56.43860351559185]
We introduce textbfTopV, a compatible textbfTOken textbfPruning with inference Time Optimization for fast and low-memory textbfVLM.<n>Our framework incorporates a visual-aware cost function to measure the importance of each source visual token, enabling effective pruning of low-importance tokens.
arXiv Detail & Related papers (2025-03-24T01:47:26Z) - FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression [16.53645461974695]
Current training-free visual token compression methods exhibit serious performance degradation in tasks involving high-resolution images.<n>We propose an efficient visual token compression framework for text-oriented Vision Large Language Models (VLLMs) in high-resolution scenarios.<n>Our approach significantly reduces computational overhead while outperforming the baselines across a range of text-oriented benchmarks.
arXiv Detail & Related papers (2025-02-22T16:05:33Z) - PruneVid: Visual Token Pruning for Efficient Video Large Language Models [24.889834611542955]
We introduce PruneVid, a visual token pruning method designed to enhance the efficiency of multi-modal video understanding.<n>LLMs have shown promising performance in video tasks due to their extended capabilities in comprehending visual modalities.<n>We validate our method across multiple video benchmarks, which demonstrate that PruneVid can prune over 80% of tokens.
arXiv Detail & Related papers (2024-12-20T18:01:58Z) - VisionZip: Longer is Better but Not Necessary in Vision Language Models [53.199716363090154]
Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens.<n>Visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy.<n>We introduce VisionZip, a method that selects a set of informative tokens for input to the language model.
arXiv Detail & Related papers (2024-12-05T18:59:53Z) - FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression [45.37530855889661]
High-resolution images lead to a quadratic increase in the number of visual tokens input into Multi-modal Large Language Models.
Current work develop visual token compression methods to achieve efficiency improvements, often at the expense of performance.
We build a coarse-to-fine visual token compression method, with a vision-guided sampler for compressing redundant regions with low information density, and a text-guided sampler for selecting visual tokens that are strongly correlated with the user instructions.
arXiv Detail & Related papers (2024-11-21T15:37:52Z) - Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks.<n>To reduce inference costs, one can either downsize the Large Language Models (LLMs) or reduce the number of input tokens needed to represent the image.<n>We take the first steps toward designing token compression algorithms tailored for high-compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.