SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
- URL: http://arxiv.org/abs/2508.16201v2
- Date: Thu, 28 Aug 2025 06:44:28 GMT
- Title: SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
- Authors: Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li,
- Abstract summary: We introduce SpecVLM, a training-free speculative decoding framework tailored for Vid-LLMs.<n>SpecVLM prunes up to 90% of video tokens to enable efficient speculation without sacrificing accuracy.<n>It achieves up to 2.68$times$ decoding speedup for LLaVA-OneVision-72B and 2.11$times$ speedup for Qwen2.5-VL-32B.
- Score: 27.000912841279597
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding. To mitigate the information loss of recent video token reduction methods and accelerate the decoding stage of Vid-LLMs losslessly, we introduce SpecVLM, a training-free speculative decoding (SD) framework tailored for Vid-LLMs that incorporates staged video token pruning. Building on our novel finding that the draft model's speculation exhibits low sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens to enable efficient speculation without sacrificing accuracy. To achieve this, we performs a two-stage pruning process: Stage I selects highly informative tokens guided by attention signals from the verifier (target model), while Stage II prunes remaining redundant ones in a spatially uniform manner. Extensive experiments on four video understanding benchmarks demonstrate the effectiveness and robustness of SpecVLM, which achieves up to 2.68$\times$ decoding speedup for LLaVA-OneVision-72B and 2.11$\times$ speedup for Qwen2.5-VL-32B. Code is available at https://github.com/zju-jiyicheng/SpecVLM.
Related papers
- Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory [50.30283773196725]
Existing approaches rely on key-value caching to accumulate frame-level details over time, but use a limited number of tokens per frame.<n>We propose scaling the token budget to enable more granular-temporal understanding and reasoning.
arXiv Detail & Related papers (2026-02-20T18:59:50Z) - CoPE-VideoLM: Codec Primitives For Efficient Video Language Models [56.76440182038839]
Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos.<n>Current methods use sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage.<n>We propose to leverage video primitives which encode video redundancy and sparsity without requiring expensive full-image encoding for most frames.
arXiv Detail & Related papers (2026-02-13T18:57:31Z) - FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging [27.981298261747288]
FlashVID is a training-free acceleration framework for Video Large Language Models (VLLMs)<n>It selects the most representative tokens for basic video representation, then applies Tree-based Stemporal Tokenging (TSTM) for fine-temporal redundancy.<n>FlashVID can serve as a training-free and plug-and-play module for extending long video frames, which enables a 10x increase in video frame input to Qwen2.5-VL.
arXiv Detail & Related papers (2026-02-08T15:56:46Z) - Towards Effective and Efficient Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval [57.88666884515147]
We propose One-shot video-Clip based Retrieval AuGmentation (OneClip-RAG)<n>OneClip-RAG makes full use of the merits of video clips for augmented video understanding.<n>It is also equipped with a novel query-guided video chunking algorithm.
arXiv Detail & Related papers (2025-12-09T09:40:20Z) - video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory [51.03819128505358]
Video-SALMONN S is first to process 3-hour videos at 1 FPS and 360p resolution under a fixed memory budget.<n>A test-time-training memory module continually updates token representations to capture long-range dependencies.<n>A prompt-dependent memory reader retrieves context-relevant content from fixed-size memory.
arXiv Detail & Related papers (2025-10-13T08:20:15Z) - METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding [55.38256656122857]
We propose METok, a training-free, Multi-stage Event-based Token compression framework.<n>We show METok achieves an optimal trade-off between efficiency and accuracy by dynamically selecting informative visual tokens.<n>For instance, equipping LongVA-7B with METok realizes an 80.6% FLOPs reduction and 93.5% KV Cache memory savings.
arXiv Detail & Related papers (2025-06-03T13:19:41Z) - Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs [25.13186579764434]
We introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules.<n>StD is a tuning-free, plug-and-play solution that achieves up to a 1.94$times$ walltime speedup in video processing.
arXiv Detail & Related papers (2025-05-25T14:09:28Z) - LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding [29.719450799231705]
Vision-Language Models (VLMs) obtain frame-level understanding capabilities through multi-frame input.<n>Video Large Language Models (Video-LLMs) capture temporal relationships within visual features but are limited by the scarcity of high-quality video-text datasets.<n>We propose Lightweight Video Compression (LVC), a novel method featuring the Query-Attention Video Compression mechanism.
arXiv Detail & Related papers (2025-04-09T12:51:10Z) - FastVID: Dynamic Density Pruning for Fast Video Large Language Models [38.267065642416554]
We propose Density Pruning for Fast Video LLMs termed FastVID.<n>FastVID partitions videos into temporally ordered segments to preserve temporal structure.<n>Our method significantly reduces computational overhead while maintaining temporal and visual integrity.
arXiv Detail & Related papers (2025-03-14T08:33:08Z) - STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [118.72266141321647]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding.<n>During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes.<n>Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.