Related papers: Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs

Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs

URL: http://arxiv.org/abs/2602.15318v1
Date: Tue, 17 Feb 2026 02:51:36 GMT
Title: Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs
Authors: Libo Zhang, Zhaoning Zhang, Wangyang Hong, Peng Qiao, Dongsheng Li,
Abstract summary: Video Large Language Models (Vid-LLMs) typically fall into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches.<n>We propose the Sparrow framework, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model.<n>Experiments show that Sparrow achieves an average speedup of 2.82x even with 25k visual tokens, effectively resolving the performance degradation in long sequences.
Score: 28.766303423132722
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although speculative decoding is widely used to accelerate Vision-Language Models (VLMs) inference, it faces severe performance collapse when applied to Video Large Language Models (Vid-LLMs). The draft model typically falls into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches. We observe a visual semantic internalization phenomenon in Vid-LLMs, indicating that critical visual semantics are implicitly encoded into text hidden states during deep-layer interactions, which renders raw visual inputs structurally redundant during deep inference. To address this, we propose the Sparrow framework, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model, and leverages intermediate-layer visual state bridging to train the draft model with semantic-rich intermediate states, thereby filtering out low-level visual noise. Additionally, a multi-token prediction strategy is introduced to bridge the training-inference distribution shift. Experiments show that Sparrow achieves an average speedup of 2.82x even with 25k visual tokens, effectively resolving the performance degradation in long sequences and offering a practical solution for real-time long video tasks.

Related papers

Chatting with Images for Introspective Visual Thinking [50.7747647794877]
''Chatting with images'' is a new framework that reframes visual manipulation as language-guided feature modulation.<n>Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions.<n>ViLaVT achieves strong and consistent improvements on complex multi-image and video-based spatial reasoning tasks.
arXiv Detail & Related papers (2026-02-11T17:42:37Z)
ViTCoP: Accelerating Large Vision-Language Models via Visual and Textual Semantic Collaborative Pruning [8.933549837045932]
Large Vision-Language Models incur high computational costs due to significant redundancy in their visual tokens.<n>We propose a Visual and Textual Semantic Collaborative Pruning framework (ViTCoP) that combines redundancy filtering in the vision encoder with step-wise co-pruning within the Large Language Models.
arXiv Detail & Related papers (2026-01-25T12:47:30Z)
Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models [41.59364061354628]
Image-to-Video (I2V) generation aims to synthesize a video from a reference image and a text prompt.<n>Existing I2V models prioritize visual consistency.<n>How to effectively couple this dual guidance to ensure strong adherence to the text prompt remains underexplored.
arXiv Detail & Related papers (2026-01-12T07:48:26Z)
CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models [66.56549019393042]
Video-language models (VLMs) achieve strong multimodal understanding but remain prone to hallucinations, especially when reasoning about actions and temporal order.<n>We propose a scalable framework for counterfactual video generation that synthesizes videos differing only in actions or temporal structure while preserving scene context.
arXiv Detail & Related papers (2026-01-08T10:03:07Z)
VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning [69.64660280965971]
VideoAnchor is a plug-and-play module that leverages subspace affinities to reinforce visual cues across frames without retraining.<n>We show consistent performance gains on benchmarks with InternVL2-8B and Q2.5VL-72B.<n>Our codes will be made public at https://github.com/feufhd/VideoAnchor.
arXiv Detail & Related papers (2025-09-29T17:54:04Z)
HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models [23.98782884568504]
We propose Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models (HiViS)<n>HiViS is an explicit-implicit input decomposition framework that alleviates the inefficiency of Speculative Decoding in Vision-Language Models.<n>Our approach compresses the prefill sequence length of the drafter to only 0.7%-1.3% of the target VLM's input.
arXiv Detail & Related papers (2025-09-28T15:05:21Z)
Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment [38.04426918886084]
Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics.<n>Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs)<n>We introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention.
arXiv Detail & Related papers (2025-06-27T14:55:40Z)
Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification [47.40091830500585]
Video-based Visible-basedInfrared Person Re-Identification (VVIReID) aims to match pedestrian sequences across modalities by extracting modality-in sequence-level features.<n>A framework, video-level language-driven VVI-ReID (VLD), consists of two core modules: inmodality language (IMLP) and spatialtemporal aggregation.
arXiv Detail & Related papers (2025-06-03T04:49:08Z)
ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations [33.74746234704817]
Video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description.<n>This is challenging as it involves deep vision-level understanding, pixel-level dense prediction andtemporal reasoning.<n>We propose bfReferDINO RVOS that inherits region-level vision-text alignment from foundational visual grounding models.
arXiv Detail & Related papers (2025-01-24T16:24:15Z)
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z)
Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models. We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z)
Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks. We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z)
Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal Sentence Localization in Videos [67.12603318660689]
We propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN) HVSARN enables both visual- and semantic-aware query reasoning from object-level to frame-level. Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-02T08:00:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.