Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration
- URL: http://arxiv.org/abs/2412.13180v2
- Date: Thu, 31 Jul 2025 22:51:46 GMT
- Title: Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration
- Authors: Mark Endo, Xiaohan Wang, Serena Yeung-Levy,
- Abstract summary: We study the popular acceleration approach of early pruning of visual tokens inside the language model.<n>Surprisingly, we find that while strong performance is maintained across many tasks, it exhibits drastically different behavior for a subset of vision-centric tasks such as localization.<n>We propose FEATHER, a straightforward approach that resolves the discovered early-layer pruning issue and further enhances the preservation of relevant tokens.
- Score: 19.683461002518147
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works on accelerating Vision-Language Models achieve strong performance across a variety of vision-language tasks despite highly compressing visual information. In this work, we examine the popular acceleration approach of early pruning of visual tokens inside the language model. Surprisingly, we find that while strong performance is maintained across many tasks, it exhibits drastically different behavior for a subset of vision-centric tasks such as localization. Upon further investigation, we uncover a core issue with the acceleration approach where most tokens towards the top of the image are pruned away. Yet, on many benchmarks aiming to evaluate vision-centric capabilities, strong performance persists with the flawed pruning strategy, highlighting these benchmarks' limited ability to assess fine-grained visual capabilities. Based on these findings, we propose FEATHER (Fast and Effective Acceleration wiTH Ensemble cRiteria), a straightforward approach that resolves the discovered early-layer pruning issue and further enhances the preservation of relevant tokens via multistage pruning with early uniform sampling to ensure broad image coverage. With comparable computational savings, we find that FEATHER achieves more than 5x performance improvement on the vision-centric localization benchmarks compared to the original acceleration approach.
Related papers
- A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models [94.49953824684853]
We introduce a dynamic pruning framework, GlimpsePrune, inspired by human cognition.<n>It takes a data-driven ''glimpse'' and prunes irrelevant visual tokens in a single forward pass before answer generation.<n>An enhanced GlimpsePrune+ achieves 110% of the baseline performance while maintaining a similarly high pruning rate.
arXiv Detail & Related papers (2025-08-03T02:15:43Z) - VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning [95.89543460132413]
Vision-language models (VLMs) have improved performance by increasing the number of visual tokens.<n>However, most real-world scenarios do not require such an extensive number of visual tokens.<n>We present a new paradigm for visual token compression, namely, VisionThink.
arXiv Detail & Related papers (2025-07-17T17:59:55Z) - GreedyPrune: Retenting Critical Visual Token Set for Large Vision Language Models [5.025353943896242]
GreedyPrune is a training-free visual token pruning algorithm designed to optimize semantic saliency and visual diversity.<n>We show that GreedyPrune achieves state-of-the-art accuracy across various multimodal tasks and models while significantly reducing end-to-end inference latency.
arXiv Detail & Related papers (2025-06-16T07:21:11Z) - VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models [57.2662376527586]
VScan is a two-stage visual token reduction framework.<n>It addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model.<n>VScan achieves a 2.91$times$ speedup in prefilling and a 10$times$ reduction in FLOPs, while retaining 95.4% of the original performance.
arXiv Detail & Related papers (2025-05-28T17:59:08Z) - STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference [3.9464481148889354]
We propose STAR (Stage-wise Attention-guided token Reduction), a training-free, plug-and-play framework that approaches token pruning from a global perspective.<n>Instead of pruning at a single point, STAR performs attention-guided reduction in two complementary stages: an early-stage pruning based on visual self-attention to remove redundant low-level features, and a later-stage pruning guided by cross-modal attention to discard task-irrelevant tokens.
arXiv Detail & Related papers (2025-05-18T10:44:45Z) - Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning [70.57180215148125]
Visual instruction tuning aims to enable large language models to comprehend the visual world.<n>Existing methods often grapple with the intractable trade-off between accuracy and efficiency.<n>We present LLaVA-Meteor, a novel approach that strategically compresses visual tokens without compromising core information.
arXiv Detail & Related papers (2025-05-17T10:22:29Z) - Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping [13.846838416902575]
A key bottleneck stems from the proliferation of visual tokens required for fine-grained image understanding.
We propose Skip-Vision, a unified framework addressing both training and inference inefficiencies in vision-language models.
Experimental results demonstrate that Skip-Vision reduces training time by up to 35%, inference FLOPs by 75%, and latency by 45%.
arXiv Detail & Related papers (2025-03-26T04:16:48Z) - TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model [56.43860351559185]
We introduce textbfTopV, a compatible textbfTOken textbfPruning with inference Time Optimization for fast and low-memory textbfVLM.
Our framework incorporates a visual-aware cost function to measure the importance of each source visual token, enabling effective pruning of low-importance tokens.
arXiv Detail & Related papers (2025-03-24T01:47:26Z) - FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression [16.53645461974695]
Current training-free visual token compression methods exhibit serious performance degradation in tasks involving high-resolution images.
We propose an efficient visual token compression framework for text-oriented Vision Large Language Models (VLLMs) in high-resolution scenarios.
Our approach significantly reduces computational overhead while outperforming the baselines across a range of text-oriented benchmarks.
arXiv Detail & Related papers (2025-02-22T16:05:33Z) - VisionZip: Longer is Better but Not Necessary in Vision Language Models [53.199716363090154]
Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens.<n>Visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy.<n>We introduce VisionZip, a method that selects a set of informative tokens for input to the language model.
arXiv Detail & Related papers (2024-12-05T18:59:53Z) - FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression [45.37530855889661]
High-resolution images lead to a quadratic increase in the number of visual tokens input into Multi-modal Large Language Models.
Current work develop visual token compression methods to achieve efficiency improvements, often at the expense of performance.
We build a coarse-to-fine visual token compression method, with a vision-guided sampler for compressing redundant regions with low information density, and a text-guided sampler for selecting visual tokens that are strongly correlated with the user instructions.
arXiv Detail & Related papers (2024-11-21T15:37:52Z) - Autogenic Language Embedding for Coherent Point Tracking [19.127052469203612]
We introduce a novel approach leveraging language embeddings to enhance the coherence of frame-wise visual features related to the same object.
Unlike existing visual-language schemes, our approach learns text embeddings from visual features through a dedicated mapping network.
Our approach significantly improves tracking trajectories in lengthy videos with substantial appearance variations.
arXiv Detail & Related papers (2024-07-30T11:02:45Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance.
We propose to apply adaptive resolution for different regions in the image according to their importance.
We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z) - BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up
Patch Summarization [89.52943129132217]
We propose a Bottom-Up Patch Summarization approach named BUS to learn a concise summary of lengthy visual token sequences efficiently.
We incorporate a Text-Semantics-Aware Patch Selector (TSPS) into the ViT backbone to perform a coarse-grained visual token extraction.
This bottom-up collaboration enables our BUS to yield high training efficiency while maintaining or even improving effectiveness.
arXiv Detail & Related papers (2023-07-17T14:08:17Z) - Fine-Grained Visual Prompting [35.032567257651515]
Fine-Grained Visual Prompting (FGVP) demonstrates superior performance in zero-shot comprehension of referring expressions.
It outperforms prior methods by an average margin of 3.0% to 4.6%, with a maximum improvement of 12.5% on the RefCOCO+ testA subset.
arXiv Detail & Related papers (2023-06-07T11:39:56Z) - SmartTrim: Adaptive Tokens and Attention Pruning for Efficient
Vision-Language Models [35.5601603013045]
We propose SmartTrim, an adaptive acceleration framework for Vision-Language Models (VLMs)
We integrate lightweight modules into the original backbone to identify and prune redundant token representations and attention heads within each layer.
We devise a self-distillation strategy to enhance the consistency between the predictions of the pruned model and its fully-capacity counterpart.
arXiv Detail & Related papers (2023-05-24T11:18:00Z) - Effective End-to-End Vision Language Pretraining with Semantic Visual
Loss [58.642954383282216]
Current vision language pretraining models are dominated by methods using region visual features extracted from object detectors.
We introduce three types of visual losses that enable much faster convergence and better finetuning accuracy.
Compared with region feature models, our end-to-end models could achieve similar or better performance on downstream tasks and run more than 10 times faster during inference.
arXiv Detail & Related papers (2023-01-18T00:22:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.