Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs
- URL: http://arxiv.org/abs/2505.19155v1
- Date: Sun, 25 May 2025 14:09:28 GMT
- Title: Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs
- Authors: Xuan Zhang, Cunxiao Du, Sicheng Yu, Jiawei Wu, Fengzhuo Zhang, Wei Gao, Qian Liu,
- Abstract summary: We introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules.<n>StD is a tuning-free, plug-and-play solution that achieves up to a 1.94$times$ walltime speedup in video processing.
- Score: 25.13186579764434
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Due to the auto-regressive nature of current video large language models (Video-LLMs), the inference latency increases as the input sequence length grows, posing challenges for the efficient processing of video sequences that are usually very long. We observe that during decoding, the attention scores of most tokens in Video-LLMs tend to be sparse and concentrated, with only certain tokens requiring comprehensive full attention. Based on this insight, we introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules: one leveraging sparse top-K attention and the other employing dense full attention. These modules collaborate to accelerate Video-LLMs without loss. The fast (sparse) model speculatively decodes multiple tokens, while the slow (dense) model verifies them in parallel. StD is a tuning-free, plug-and-play solution that achieves up to a 1.94$\times$ walltime speedup in video processing. It maintains model performance while enabling a seamless transition from a standard Video-LLM to a sparse Video-LLM with minimal code modifications.
Related papers
- ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding [12.236081012244533]
This study focuses on video understanding by large language models (MLLMs)<n>We propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations.<n>We demonstrate the effectiveness of ReMoRa through extensive experiments across a comprehensive suite of long-video understanding benchmarks.
arXiv Detail & Related papers (2026-02-18T12:37:35Z) - VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding [52.69880888587866]
Current Video Large Language Models (Video LLMs) typically encode frames via a encoder vision and employ an autoregressive (AR) LLM for understanding and generation.<n>We propose VidLaDA, a Diffusion Video LLM based on Language Models (DLMs) that leverages bidirectional attention to unlock comprehensive modeling and decode tokens in parallel.<n>Experiments show VidLaDA rivals state-of-the-art AR baselines and outperforms DLM baselines, with MARS-Cache delivering over 12x speedup without compromising accuracy.
arXiv Detail & Related papers (2026-01-25T15:02:01Z) - FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding [55.700832127331324]
FLoC is an efficient visual token compression framework based on the facility location function.<n>Our method achieves remarkable efficiency gains by swiftly selecting a compact subset of tokens.<n>Our approach is training-free, model-agnostic, and query-agnostic, providing a versatile solution.
arXiv Detail & Related papers (2025-10-31T17:29:39Z) - video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory [51.03819128505358]
Video-SALMONN S is first to process 3-hour videos at 1 FPS and 360p resolution under a fixed memory budget.<n>A test-time-training memory module continually updates token representations to capture long-range dependencies.<n>A prompt-dependent memory reader retrieves context-relevant content from fixed-size memory.
arXiv Detail & Related papers (2025-10-13T08:20:15Z) - SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning [27.000912841279597]
We introduce SpecVLM, a training-free speculative decoding framework tailored for Vid-LLMs.<n>SpecVLM prunes up to 90% of video tokens to enable efficient speculation without sacrificing accuracy.<n>It achieves up to 2.68$times$ decoding speedup for LLaVA-OneVision-72B and 2.11$times$ speedup for Qwen2.5-VL-32B.
arXiv Detail & Related papers (2025-08-22T08:23:09Z) - Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference [88.57742986765238]
Free-MoRef is a training-free approach to multiplex the context perception capabilities of Video-MLLMs.<n>Experiments show that Free-MoRef achieves full perception of 2$times$ to 8$times$ longer input frames without compression on a single A100 GPU.
arXiv Detail & Related papers (2025-08-04T07:31:10Z) - QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design [54.38970077613728]
Long-video understanding has emerged as a crucial capability in real-world applications such as video surveillance, meeting summarization, educational lecture analysis, and sports broadcasting.<n>We propose QuickVideo, a system-algorithm co-design that substantially accelerates long-video understanding to support real-time downstream applications.
arXiv Detail & Related papers (2025-05-22T03:26:50Z) - Multimodal Long Video Modeling Based on Temporal Dynamic Context [13.979661295432964]
We propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC)<n>We segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders.<n>To handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments.
arXiv Detail & Related papers (2025-04-14T17:34:06Z) - Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks [21.710127132217526]
We introduce a new paradigm called Video Interface Networks (VINs), which augment DiTs with an abstraction module to enable parallel inference of video chunks.<n>VINs encode global semantics from the noisy input of local chunks and the encoded representations, in turn, guide DiTs in denoising chunks in parallel.<n>Our approach attains state-of-the-art motion smoothness while using 25-40% fewer FLOPs than full generation.
arXiv Detail & Related papers (2025-03-21T21:13:02Z) - BIMBA: Selective-Scan Compression for Long-Range Video Question Answering [46.199493246921435]
Video Question Answering (VQA) in long videos poses the key challenge of extracting relevant information.<n>We introduce BIMBA, an efficient state-space model to handle long-form videos.
arXiv Detail & Related papers (2025-03-12T17:57:32Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos.
We leverage DINOv2 features to remove redundant frames that exhibit high similarity.
We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z) - Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner [53.671484175063995]
Video-LLMs are pre-trained to process short videos, limiting their broader application for understanding longer video content.
We introduce an alternative video token rearrangement technique that circumvents limitations imposed by the fixed video encoder and alignment projector.
arXiv Detail & Related papers (2024-09-19T17:59:55Z) - SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models [51.712700398020075]
We propose a training-free video large language model (LLM) that can jointly capture detailed spatial semantics and long-range temporal context.
This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled frames in an effective way.
Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks.
arXiv Detail & Related papers (2024-07-22T17:58:04Z) - LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.