Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs
- URL: http://arxiv.org/abs/2505.19155v1
- Date: Sun, 25 May 2025 14:09:28 GMT
- Title: Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs
- Authors: Xuan Zhang, Cunxiao Du, Sicheng Yu, Jiawei Wu, Fengzhuo Zhang, Wei Gao, Qian Liu,
- Abstract summary: We introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules.<n>StD is a tuning-free, plug-and-play solution that achieves up to a 1.94$times$ walltime speedup in video processing.
- Score: 25.13186579764434
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Due to the auto-regressive nature of current video large language models (Video-LLMs), the inference latency increases as the input sequence length grows, posing challenges for the efficient processing of video sequences that are usually very long. We observe that during decoding, the attention scores of most tokens in Video-LLMs tend to be sparse and concentrated, with only certain tokens requiring comprehensive full attention. Based on this insight, we introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules: one leveraging sparse top-K attention and the other employing dense full attention. These modules collaborate to accelerate Video-LLMs without loss. The fast (sparse) model speculatively decodes multiple tokens, while the slow (dense) model verifies them in parallel. StD is a tuning-free, plug-and-play solution that achieves up to a 1.94$\times$ walltime speedup in video processing. It maintains model performance while enabling a seamless transition from a standard Video-LLM to a sparse Video-LLM with minimal code modifications.
Related papers
- Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference [88.57742986765238]
Free-MoRef is a training-free approach to multiplex the context perception capabilities of Video-MLLMs.<n>Experiments show that Free-MoRef achieves full perception of 2$times$ to 8$times$ longer input frames without compression on a single A100 GPU.
arXiv Detail & Related papers (2025-08-04T07:31:10Z) - QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design [54.38970077613728]
Long-video understanding has emerged as a crucial capability in real-world applications such as video surveillance, meeting summarization, educational lecture analysis, and sports broadcasting.<n>We propose QuickVideo, a system-algorithm co-design that substantially accelerates long-video understanding to support real-time downstream applications.
arXiv Detail & Related papers (2025-05-22T03:26:50Z) - Multimodal Long Video Modeling Based on Temporal Dynamic Context [13.979661295432964]
We propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC)<n>We segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders.<n>To handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments.
arXiv Detail & Related papers (2025-04-14T17:34:06Z) - Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks [21.710127132217526]
We introduce a new paradigm called Video Interface Networks (VINs), which augment DiTs with an abstraction module to enable parallel inference of video chunks.<n>VINs encode global semantics from the noisy input of local chunks and the encoded representations, in turn, guide DiTs in denoising chunks in parallel.<n>Our approach attains state-of-the-art motion smoothness while using 25-40% fewer FLOPs than full generation.
arXiv Detail & Related papers (2025-03-21T21:13:02Z) - BIMBA: Selective-Scan Compression for Long-Range Video Question Answering [46.199493246921435]
Video Question Answering (VQA) in long videos poses the key challenge of extracting relevant information.<n>We introduce BIMBA, an efficient state-space model to handle long-form videos.
arXiv Detail & Related papers (2025-03-12T17:57:32Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos.
We leverage DINOv2 features to remove redundant frames that exhibit high similarity.
We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z) - Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner [53.671484175063995]
Video-LLMs are pre-trained to process short videos, limiting their broader application for understanding longer video content.
We introduce an alternative video token rearrangement technique that circumvents limitations imposed by the fixed video encoder and alignment projector.
arXiv Detail & Related papers (2024-09-19T17:59:55Z) - SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models [51.712700398020075]
We propose a training-free video large language model (LLM) that can jointly capture detailed spatial semantics and long-range temporal context.
This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled frames in an effective way.
Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks.
arXiv Detail & Related papers (2024-07-22T17:58:04Z) - LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.