Focus-Scan-Refine: From Human Visual Perception to Efficient Visual Token Pruning
- URL: http://arxiv.org/abs/2602.05809v2
- Date: Mon, 09 Feb 2026 12:35:58 GMT
- Title: Focus-Scan-Refine: From Human Visual Perception to Efficient Visual Token Pruning
- Authors: Enwei Tong, Yuanchao Bai, Yao Zhu, Junjun Jiang, Xianming Liu,
- Abstract summary: Vision-language models (VLMs) often generate massive visual tokens that greatly increase inference latency and memory footprint.<n>We propose Focus-Scan-Refine (FSR), a human-inspired, plug-and-play pruning framework that mimics how humans answer visual questions.<n>FSR consistently improves the accuracy-efficiency trade-off over existing state-of-the-art pruning methods.
- Score: 78.75062483648243
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language models (VLMs) often generate massive visual tokens that greatly increase inference latency and memory footprint; while training-free token pruning offers a practical remedy, existing methods still struggle to balance local evidence and global context under aggressive compression. We propose Focus-Scan-Refine (FSR), a human-inspired, plug-and-play pruning framework that mimics how humans answer visual questions: focus on key evidence, then scan globally if needed, and refine the scanned context by aggregating relevant details. FSR first focuses on key evidence by combining visual importance with instruction relevance, avoiding the bias toward visually salient but query-irrelevant regions. It then scans for complementary context conditioned on the focused set, selecting tokens that are most different from the focused evidence. Finally, FSR refines the scanned context by aggregating nearby informative tokens into the scan anchors via similarity-based assignment and score-weighted merging, without increasing the token budget. Extensive experiments across multiple VLM backbones and vision-language benchmarks show that FSR consistently improves the accuracy-efficiency trade-off over existing state-of-the-art pruning methods. The source codes can be found at https://github.com/ILOT-code/FSR.
Related papers
- IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning [27.75049214892312]
Large Vision-Language Models (LVLMs) achieve impressive performance across multiple tasks.<n>A significant challenge, however, is their prohibitive inference cost when processing high-resolution visual inputs.<n>We propose textbfIVC-Prune, a training-free, prompt-aware pruning strategy that retains both IVC tokens and semantically relevant foreground tokens.
arXiv Detail & Related papers (2026-02-03T03:39:31Z) - ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models [7.7352936204066]
We propose a novel, zero-shot method to model visual token pruning as a balance between task relevance and information diversity.<n>Our method achieves performance that matches or surpasses the state-of-the-art with only minimal accuracy loss.<n>These gains are accompanied by significant reductions in GPU memory footprint and inference latency.
arXiv Detail & Related papers (2025-10-20T06:18:47Z) - Don't Just Chase "Highlighted Tokens" in MLLMs: Revisiting Visual Holistic Context Retention [50.97683288777336]
Multimodal Large Language Models (MLLMs) suffer from considerable computational overhead due to their reliance on massive visual tokens.<n>Recent studies have explored token pruning to alleviate this problem, which typically uses text-vision cross-attention.<n>We propose HoloV, a plug-and-play visual token pruning framework for efficient inference.
arXiv Detail & Related papers (2025-10-03T11:33:40Z) - VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization [70.98122339799218]
Large Multimodal Models (LMMs) excel in visual-language tasks by leveraging numerous visual tokens for fine-grained visual information.<n>Previous research aimed at reducing visual tokens during inference typically leverages importance maps derived from attention scores among vision-only tokens or vision-language tokens to prune tokens across one or multiple pruning stages.<n>We propose VFlowOpt, a token pruning framework that introduces an importance map derivation process and a progressive pruning module with a recycling mechanism.<n> Experiments demonstrate that VFlowOpt can prune 90% of visual tokens while maintaining comparable performance, leading to an 89% reduction in KV-Cache memory and 3.8
arXiv Detail & Related papers (2025-08-07T09:47:21Z) - Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment [38.04426918886084]
Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics.<n>Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs)<n>We introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention.
arXiv Detail & Related papers (2025-06-27T14:55:40Z) - VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models [57.2662376527586]
VScan is a two-stage visual token reduction framework.<n>It addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model.<n>VScan achieves a 2.91$times$ speedup in prefilling and a 10$times$ reduction in FLOPs, while retaining 95.4% of the original performance.
arXiv Detail & Related papers (2025-05-28T17:59:08Z) - Beyond Intermediate States: Explaining Visual Redundancy through Language [7.275188652473603]
Multi-modal Large Langue Models (MLLMs) often process thousands of visual tokens.<n>Visual tokens with low ViT-[cls] association and low text-to-image attention scores can contain recognizable information.<n>We develop a more reliable method for identifying and pruning redundant visual tokens.
arXiv Detail & Related papers (2025-03-26T13:38:10Z) - QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension [86.0749609778104]
We propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models.<n>QuoTA strategically allocates frame-level importance scores based on query relevance.<n>We decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring.
arXiv Detail & Related papers (2025-03-11T17:59:57Z) - Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching [74.75284453828017]
Open-Vocabulary Keypoint Detection (OVKD) task is innovatively designed to use text prompts for identifying arbitrary keypoints across any species.
We have developed a novel framework named Open-Vocabulary Keypoint Detection with Semantic-feature Matching (KDSM)
This framework combines vision and language models, creating an interplay between language features and local keypoint visual features.
arXiv Detail & Related papers (2023-10-08T07:42:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.