PIO-FVLM: Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective
- URL: http://arxiv.org/abs/2602.04657v2
- Date: Thu, 05 Feb 2026 12:00:10 GMT
- Title: PIO-FVLM: Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective
- Authors: Haokui Zhang, Congyang Ou, Dawei Yan, Peng Wang, Qingsen Yan, Ying Li, Rong Xiao, Chunhua Shen,
- Abstract summary: We propose PIO-FVLM to reduce redundant visual tokens in vision-models (VLMs) to accelerate inference.<n>The proposed PIO-FVLM is training-free and compatible with FlashAttention, friendly to practical application and deployment.<n>On LLaVA-Next-7B, PIO-FVLM retains just 11.1% of visual tokens but maintains 97.2% of the original performance.
- Score: 59.24570811503256
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, reducing redundant visual tokens in vision-language models (VLMs) to accelerate VLM inference has emerged as a hot topic. However, most existing methods rely on heuristics constructed based on inter-visual-token similarity or cross-modal visual-text similarity, which gives rise to certain limitations in compression performance and practical deployment. In contrast, we propose PIO-FVLM from the perspective of inference objectives, which transforms visual token compression into preserving output result invariance and selects tokens primarily by their importance to this goal. Specially, vision tokens are reordered with the guidance of token-level gradient saliency generated by our designed layer-local proxy loss, a coarse constraint from the current layer to the final result. Then the most valuable vision tokens are selected following the non-maximum suppression (NMS) principle. The proposed PIO-FVLM is training-free and compatible with FlashAttention, friendly to practical application and deployment. It can be deployed independently as an encoder-free method, or combined with encoder compression approaches like VisionZip for use as an encoder-involved method. On LLaVA-Next-7B, PIO-FVLM retains just 11.1% of visual tokens but maintains 97.2% of the original performance, with a 2.67$\times$ prefill speedup, 2.11$\times$ inference speedup, 6.22$\times$ lower FLOPs, and 6.05$\times$ reduced KV Cache overhead. Our code is available at https://github.com/ocy1/PIO-FVLM.
Related papers
- ConsensusDrop: Fusing Visual and Cross-Modal Saliency for Efficient Vision Language Models [4.273730624882391]
Vision-Language Models (VLMs) are expensive because the LLM processes hundreds of largely redundant visual tokens.<n>We show that neither signal alone is sufficient: fusing them consistently improves performance compared to unimodal visual token selection (ranking)<n>We propose textbfConsensusDrop, a training-free framework that derives a emphconsensus ranking by reconciling vision encoder saliency with query-aware cross-attention.
arXiv Detail & Related papers (2026-02-01T00:28:55Z) - StreamingTOM: Streaming Token Compression for Efficient Video Understanding [6.9203477336374775]
Existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged.<n>We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks with predictable latency.<n> Experiments demonstrate our method achieves $15.7times$ kv-cache compression, $1.2times$ lower peak memory and $2times$ faster TTFT compared to prior SOTA.
arXiv Detail & Related papers (2025-10-21T03:39:41Z) - VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs [82.72388893596555]
Multimodal Large Language Models (MLLMs) encounter significant computational and memory bottlenecks.<n>Previous token compression techniques are often constrained by rules that risk discarding critical information.<n>We reformulate token compression as a lightweight plug-and-play framework that reformulates token compression into an end-to-end learnable decision process.
arXiv Detail & Related papers (2025-10-18T17:54:18Z) - VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models [57.2662376527586]
VScan is a two-stage visual token reduction framework.<n>It addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model.<n>VScan achieves a 2.91$times$ speedup in prefilling and a 10$times$ reduction in FLOPs, while retaining 95.4% of the original performance.
arXiv Detail & Related papers (2025-05-28T17:59:08Z) - Vision-centric Token Compression in Large Language Model [51.92055188780033]
Vision Centric Token Compression (Vist) is a slow-fast compression framework that mirrors human reading.<n>On eleven in-context learning benchmarks, Vist achieves the same accuracy with 2.3 times fewer tokens, cutting FLOPs by 16% and memory by 50%.
arXiv Detail & Related papers (2025-02-02T13:10:06Z) - DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models [28.379533608574814]
We present DyCoke, a training-free token compression method to optimize token representation and accelerate video large language models.<n>DyCoke incorporates a plug-and-play temporal compression module to minimize temporal redundancy by merging redundant tokens across frames.<n>It ensures high-quality inference by dynamically retaining the critical tokens at each decoding step.
arXiv Detail & Related papers (2024-11-22T15:55:19Z) - Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks.<n>To reduce inference costs, one can either downsize the Large Language Models (LLMs) or reduce the number of input tokens needed to represent the image.<n>We take the first steps toward designing token compression algorithms tailored for high-compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z) - VoCo-LLaMA: Towards Vision Compression with Large Language Models [31.398537194299752]
Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window.<n>We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs.<n>Our method achieves minimal performance loss with a compression ratio of 576$times$, resulting in up to 94.8$%$ fewer FLOPs and 69.6$%$ acceleration in inference time.
arXiv Detail & Related papers (2024-06-18T05:05:12Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.