STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference
- URL: http://arxiv.org/abs/2505.12359v1
- Date: Sun, 18 May 2025 10:44:45 GMT
- Title: STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference
- Authors: Yichen Guo, Hanze Li, Zonghao Zhang, Jinhao You, Kai Tang, Xiande Huang,
- Abstract summary: We propose STAR (Stage-wise Attention-guided token Reduction), a training-free, plug-and-play framework that approaches token pruning from a global perspective.<n>Instead of pruning at a single point, STAR performs attention-guided reduction in two complementary stages: an early-stage pruning based on visual self-attention to remove redundant low-level features, and a later-stage pruning guided by cross-modal attention to discard task-irrelevant tokens.
- Score: 3.9464481148889354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although large vision-language models (LVLMs) leverage rich visual token representations to achieve strong performance on multimodal tasks, these tokens also introduce significant computational overhead during inference. Existing training-free token pruning methods typically adopt a single-stage strategy, focusing either on visual self-attention or visual-textual cross-attention. However, such localized perspectives often overlook the broader information flow across the model, leading to substantial performance degradation, especially under high pruning ratios. In this work, we propose STAR (Stage-wise Attention-guided token Reduction), a training-free, plug-and-play framework that approaches token pruning from a global perspective. Instead of pruning at a single point, STAR performs attention-guided reduction in two complementary stages: an early-stage pruning based on visual self-attention to remove redundant low-level features, and a later-stage pruning guided by cross-modal attention to discard task-irrelevant tokens. This holistic approach allows STAR to significantly reduce computational cost while better preserving task-critical information. Extensive experiments across multiple LVLM architectures and benchmarks show that STAR achieves strong acceleration while maintaining comparable, and in some cases even improved performance.
Related papers
- VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization [49.5501769221435]
Large Multimodal Models (LMMs) excel in visual-language tasks by leveraging numerous visual tokens for fine-grained visual information.<n>Previous research aimed at reducing visual tokens during inference typically leverages importance maps derived from attention scores among vision-only tokens or vision-language tokens to prune tokens across one or multiple pruning stages.<n>We propose VFlowOpt, a token pruning framework that introduces an importance map derivation process and a progressive pruning module with a recycling mechanism.<n> Experiments demonstrate that VFlowOpt can prune 90% of visual tokens while maintaining comparable performance, leading to an 89% reduction in KV-Cache memory and 3.8
arXiv Detail & Related papers (2025-08-07T09:47:21Z) - Artifacts and Attention Sinks: Structured Approximations for Efficient Vision Transformers [8.486148475471271]
Vision transformers have emerged as a powerful tool across a wide range of applications, yet their inner workings remain only partially understood.<n>We examine the phenomenon of massive tokens - tokens with exceptionally high activation norms that act as attention sinks - and artifact tokens that emerge as a byproduct during inference.<n>We introduce Fast Nystr"om Attention (FNA), a training-free method that approximates self-attention in linear time and space.
arXiv Detail & Related papers (2025-07-21T19:29:03Z) - ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models [59.47738955960352]
ToDRE is a two-stage and training-free token compression framework.<n>It achieves superior performance by pruning tokens based on token Diversity and token-task RElevance.
arXiv Detail & Related papers (2025-05-24T15:47:49Z) - Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning [70.57180215148125]
Visual instruction tuning aims to enable large language models to comprehend the visual world.<n>Existing methods often grapple with the intractable trade-off between accuracy and efficiency.<n>We present LLaVA-Meteor, a novel approach that strategically compresses visual tokens without compromising core information.
arXiv Detail & Related papers (2025-05-17T10:22:29Z) - Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features [24.33252753245426]
We exploit the sparse nature in cross-attention maps to selectively prune redundant visual features.<n>Our model can reduce inference latency and memory usage while achieving benchmark parity.
arXiv Detail & Related papers (2025-04-01T09:10:32Z) - TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model [56.43860351559185]
We introduce textbfTopV, a compatible textbfTOken textbfPruning with inference Time Optimization for fast and low-memory textbfVLM.<n>Our framework incorporates a visual-aware cost function to measure the importance of each source visual token, enabling effective pruning of low-importance tokens.
arXiv Detail & Related papers (2025-03-24T01:47:26Z) - "Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z) - Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration [19.683461002518147]
We study the popular acceleration approach of early pruning of visual tokens inside the language model.<n>Surprisingly, we find that while strong performance is maintained across many tasks, it exhibits drastically different behavior for a subset of vision-centric tasks such as localization.<n>We propose FEATHER, a straightforward approach that resolves the discovered early-layer pruning issue and further enhances the preservation of relevant tokens.
arXiv Detail & Related papers (2024-12-17T18:56:50Z) - A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs [65.00970402080351]
A promising approach to accelerating large vision-language models (VLMs) is using partial information, such as attention maps from specific layers, to assess token importance and prune less essential tokens.<n>Our study reveals three key insights: (i) Partial attention information is insufficient for accurately identifying critical visual tokens, resulting in suboptimal performance, especially at low token retention ratios; (ii) Global attention information, such as the attention map aggregated across all layers, more effectively preserves essential tokens and maintains comparable performance under aggressive pruning; and (iii) The global attention map aggregated from a small VLM closely resembles that of a large VLM,
arXiv Detail & Related papers (2024-12-04T13:56:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.