Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs
- URL: http://arxiv.org/abs/2602.15318v1
- Date: Tue, 17 Feb 2026 02:51:36 GMT
- Title: Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs
- Authors: Libo Zhang, Zhaoning Zhang, Wangyang Hong, Peng Qiao, Dongsheng Li,
- Abstract summary: Video Large Language Models (Vid-LLMs) typically fall into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches.<n>We propose the Sparrow framework, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model.<n>Experiments show that Sparrow achieves an average speedup of 2.82x even with 25k visual tokens, effectively resolving the performance degradation in long sequences.
- Score: 28.766303423132722
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although speculative decoding is widely used to accelerate Vision-Language Models (VLMs) inference, it faces severe performance collapse when applied to Video Large Language Models (Vid-LLMs). The draft model typically falls into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches. We observe a visual semantic internalization phenomenon in Vid-LLMs, indicating that critical visual semantics are implicitly encoded into text hidden states during deep-layer interactions, which renders raw visual inputs structurally redundant during deep inference. To address this, we propose the Sparrow framework, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model, and leverages intermediate-layer visual state bridging to train the draft model with semantic-rich intermediate states, thereby filtering out low-level visual noise. Additionally, a multi-token prediction strategy is introduced to bridge the training-inference distribution shift. Experiments show that Sparrow achieves an average speedup of 2.82x even with 25k visual tokens, effectively resolving the performance degradation in long sequences and offering a practical solution for real-time long video tasks.
Related papers
- Chatting with Images for Introspective Visual Thinking [50.7747647794877]
''Chatting with images'' is a new framework that reframes visual manipulation as language-guided feature modulation.<n>Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions.<n>ViLaVT achieves strong and consistent improvements on complex multi-image and video-based spatial reasoning tasks.
arXiv Detail & Related papers (2026-02-11T17:42:37Z) - ViTCoP: Accelerating Large Vision-Language Models via Visual and Textual Semantic Collaborative Pruning [8.933549837045932]
Large Vision-Language Models incur high computational costs due to significant redundancy in their visual tokens.<n>We propose a Visual and Textual Semantic Collaborative Pruning framework (ViTCoP) that combines redundancy filtering in the vision encoder with step-wise co-pruning within the Large Language Models.
arXiv Detail & Related papers (2026-01-25T12:47:30Z) - Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models [41.59364061354628]
Image-to-Video (I2V) generation aims to synthesize a video from a reference image and a text prompt.<n>Existing I2V models prioritize visual consistency.<n>How to effectively couple this dual guidance to ensure strong adherence to the text prompt remains underexplored.
arXiv Detail & Related papers (2026-01-12T07:48:26Z) - CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models [66.56549019393042]
Video-language models (VLMs) achieve strong multimodal understanding but remain prone to hallucinations, especially when reasoning about actions and temporal order.<n>We propose a scalable framework for counterfactual video generation that synthesizes videos differing only in actions or temporal structure while preserving scene context.
arXiv Detail & Related papers (2026-01-08T10:03:07Z) - VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning [69.64660280965971]
VideoAnchor is a plug-and-play module that leverages subspace affinities to reinforce visual cues across frames without retraining.<n>We show consistent performance gains on benchmarks with InternVL2-8B and Q2.5VL-72B.<n>Our codes will be made public at https://github.com/feufhd/VideoAnchor.
arXiv Detail & Related papers (2025-09-29T17:54:04Z) - HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models [23.98782884568504]
We propose Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models (HiViS)<n>HiViS is an explicit-implicit input decomposition framework that alleviates the inefficiency of Speculative Decoding in Vision-Language Models.<n>Our approach compresses the prefill sequence length of the drafter to only 0.7%-1.3% of the target VLM's input.
arXiv Detail & Related papers (2025-09-28T15:05:21Z) - Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment [38.04426918886084]
Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics.<n>Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs)<n>We introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention.
arXiv Detail & Related papers (2025-06-27T14:55:40Z) - Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification [47.40091830500585]
Video-based Visible-basedInfrared Person Re-Identification (VVIReID) aims to match pedestrian sequences across modalities by extracting modality-in sequence-level features.<n>A framework, video-level language-driven VVI-ReID (VLD), consists of two core modules: inmodality language (IMLP) and spatialtemporal aggregation.
arXiv Detail & Related papers (2025-06-03T04:49:08Z) - ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations [33.74746234704817]
Video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description.<n>This is challenging as it involves deep vision-level understanding, pixel-level dense prediction andtemporal reasoning.<n>We propose bfReferDINO RVOS that inherits region-level vision-text alignment from foundational visual grounding models.
arXiv Detail & Related papers (2025-01-24T16:24:15Z) - VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens.
Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal
Sentence Localization in Videos [67.12603318660689]
We propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN)
HVSARN enables both visual- and semantic-aware query reasoning from object-level to frame-level.
Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-02T08:00:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.