Related papers: STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference

STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference

URL: http://arxiv.org/abs/2505.12359v1
Date: Sun, 18 May 2025 10:44:45 GMT
Title: STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference
Authors: Yichen Guo, Hanze Li, Zonghao Zhang, Jinhao You, Kai Tang, Xiande Huang,
Abstract summary: We propose STAR (Stage-wise Attention-guided token Reduction), a training-free, plug-and-play framework that approaches token pruning from a global perspective.<n>Instead of pruning at a single point, STAR performs attention-guided reduction in two complementary stages: an early-stage pruning based on visual self-attention to remove redundant low-level features, and a later-stage pruning guided by cross-modal attention to discard task-irrelevant tokens.
Score: 3.9464481148889354
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although large vision-language models (LVLMs) leverage rich visual token representations to achieve strong performance on multimodal tasks, these tokens also introduce significant computational overhead during inference. Existing training-free token pruning methods typically adopt a single-stage strategy, focusing either on visual self-attention or visual-textual cross-attention. However, such localized perspectives often overlook the broader information flow across the model, leading to substantial performance degradation, especially under high pruning ratios. In this work, we propose STAR (Stage-wise Attention-guided token Reduction), a training-free, plug-and-play framework that approaches token pruning from a global perspective. Instead of pruning at a single point, STAR performs attention-guided reduction in two complementary stages: an early-stage pruning based on visual self-attention to remove redundant low-level features, and a later-stage pruning guided by cross-modal attention to discard task-irrelevant tokens. This holistic approach allows STAR to significantly reduce computational cost while better preserving task-critical information. Extensive experiments across multiple LVLM architectures and benchmarks show that STAR achieves strong acceleration while maintaining comparable, and in some cases even improved performance.

Related papers

VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization [49.5501769221435]
Large Multimodal Models (LMMs) excel in visual-language tasks by leveraging numerous visual tokens for fine-grained visual information.<n>Previous research aimed at reducing visual tokens during inference typically leverages importance maps derived from attention scores among vision-only tokens or vision-language tokens to prune tokens across one or multiple pruning stages.<n>We propose VFlowOpt, a token pruning framework that introduces an importance map derivation process and a progressive pruning module with a recycling mechanism.<n> Experiments demonstrate that VFlowOpt can prune 90% of visual tokens while maintaining comparable performance, leading to an 89% reduction in KV-Cache memory and 3.8
arXiv Detail & Related papers (2025-08-07T09:47:21Z)
Artifacts and Attention Sinks: Structured Approximations for Efficient Vision Transformers [8.486148475471271]
Vision transformers have emerged as a powerful tool across a wide range of applications, yet their inner workings remain only partially understood.<n>We examine the phenomenon of massive tokens - tokens with exceptionally high activation norms that act as attention sinks - and artifact tokens that emerge as a byproduct during inference.<n>We introduce Fast Nystr"om Attention (FNA), a training-free method that approximates self-attention in linear time and space.
arXiv Detail & Related papers (2025-07-21T19:29:03Z)
ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models [59.47738955960352]
ToDRE is a two-stage and training-free token compression framework.<n>It achieves superior performance by pruning tokens based on token Diversity and token-task RElevance.
arXiv Detail & Related papers (2025-05-24T15:47:49Z)
Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning [70.57180215148125]
Visual instruction tuning aims to enable large language models to comprehend the visual world.<n>Existing methods often grapple with the intractable trade-off between accuracy and efficiency.<n>We present LLaVA-Meteor, a novel approach that strategically compresses visual tokens without compromising core information.
arXiv Detail & Related papers (2025-05-17T10:22:29Z)
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features [24.33252753245426]
We exploit the sparse nature in cross-attention maps to selectively prune redundant visual features.<n>Our model can reduce inference latency and memory usage while achieving benchmark parity.
arXiv Detail & Related papers (2025-04-01T09:10:32Z)
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model [56.43860351559185]
We introduce textbfTopV, a compatible textbfTOken textbfPruning with inference Time Optimization for fast and low-memory textbfVLM.<n>Our framework incorporates a visual-aware cost function to measure the importance of each source visual token, enabling effective pruning of low-importance tokens.
arXiv Detail & Related papers (2025-03-24T01:47:26Z)
"Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z)
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration [19.683461002518147]
We study the popular acceleration approach of early pruning of visual tokens inside the language model.<n>Surprisingly, we find that while strong performance is maintained across many tasks, it exhibits drastically different behavior for a subset of vision-centric tasks such as localization.<n>We propose FEATHER, a straightforward approach that resolves the discovered early-layer pruning issue and further enhances the preservation of relevant tokens.
arXiv Detail & Related papers (2024-12-17T18:56:50Z)
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs [65.00970402080351]
A promising approach to accelerating large vision-language models (VLMs) is using partial information, such as attention maps from specific layers, to assess token importance and prune less essential tokens.<n>Our study reveals three key insights: (i) Partial attention information is insufficient for accurately identifying critical visual tokens, resulting in suboptimal performance, especially at low token retention ratios; (ii) Global attention information, such as the attention map aggregated across all layers, more effectively preserves essential tokens and maintains comparable performance under aggressive pruning; and (iii) The global attention map aggregated from a small VLM closely resembles that of a large VLM,
arXiv Detail & Related papers (2024-12-04T13:56:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.