QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design
- URL: http://arxiv.org/abs/2505.16175v2
- Date: Sat, 31 May 2025 13:43:36 GMT
- Title: QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design
- Authors: Benjamin Schneider, Dongfu Jiang, Chao Du, Tianyu Pang, Wenhu Chen,
- Abstract summary: Long-video understanding has emerged as a crucial capability in real-world applications such as video surveillance, meeting summarization, educational lecture analysis, and sports broadcasting.<n>We propose QuickVideo, a system-algorithm co-design that substantially accelerates long-video understanding to support real-time downstream applications.
- Score: 54.38970077613728
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Long-video understanding has emerged as a crucial capability in real-world applications such as video surveillance, meeting summarization, educational lecture analysis, and sports broadcasting. However, it remains computationally prohibitive for VideoLLMs, primarily due to two bottlenecks: 1) sequential video decoding, the process of converting the raw bit stream to RGB frames can take up to a minute for hour-long video inputs, and 2) costly prefilling of up to several million tokens for LLM inference, resulting in high latency and memory use. To address these challenges, we propose QuickVideo, a system-algorithm co-design that substantially accelerates long-video understanding to support real-time downstream applications. It comprises three key innovations: QuickDecoder, a parallelized CPU-based video decoder that achieves 2-3 times speedup by splitting videos into keyframe-aligned intervals processed concurrently; QuickPrefill, a memory-efficient prefilling method using KV-cache pruning to support more frames with less GPU memory; and an overlapping scheme that overlaps CPU video decoding with GPU inference. Together, these components infernece time reduce by a minute on long video inputs, enabling scalable, high-quality video understanding even on limited hardware. Experiments show that QuickVideo generalizes across durations and sampling rates, making long video processing feasible in practice.
Related papers
- Video-XL-2: Towards Very Long-Video Understanding Through Task-Aware KV Sparsification [9.615466029246694]
Video-XL-2 is a novel MLLM that delivers superior cost-effectiveness for long-video understanding based on task-aware KV sparsification.<n>It is capable of processing over 10,000 frames on a single NVIDIA A100 (80GB) GPU and thousands of frames in just a few seconds.
arXiv Detail & Related papers (2025-06-24T01:19:56Z) - Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs [25.13186579764434]
We introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules.<n>StD is a tuning-free, plug-and-play solution that achieves up to a 1.94$times$ walltime speedup in video processing.
arXiv Detail & Related papers (2025-05-25T14:09:28Z) - VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers [23.541896057977745]
VideoScan is an efficient vision-language model (VLM) inference framework for real-time video interaction.<n>VideoScan employs a single semantic carrier token to represent each frame.
arXiv Detail & Related papers (2025-03-12T13:30:40Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - Streaming Video Question-Answering with In-context Video KV-Cache Retrieval [10.990431921021585]
We propose ReKV, a training-free approach that enables efficient streaming video question-answering (StreamingVQA)<n>Our approach analyzes long videos in a streaming manner, allowing for prompt responses as soon as user queries are received.
arXiv Detail & Related papers (2025-03-01T15:53:33Z) - Fast Encoding and Decoding for Implicit Video Representation [88.43612845776265]
We introduce NeRV-Enc, a transformer-based hyper-network for fast encoding; and NeRV-Dec, a parallel decoder for efficient video loading.
NeRV-Enc achieves an impressive speed-up of $mathbf104times$ by eliminating gradient-based optimization.
NeRV-Dec simplifies video decoding, outperforming conventional codecs with a loading speed $mathbf11times$ faster.
arXiv Detail & Related papers (2024-09-28T18:21:52Z) - Video-Infinity: Distributed Long Video Generation [73.30145218077074]
Diffusion models have recently achieved remarkable results for video generation.
Our method generates videos up to 2,300 frames in approximately 5 minutes, enabling long video generation at a speed 100 times faster than the prior methods.
arXiv Detail & Related papers (2024-06-24T01:56:12Z) - Streaming Long Video Understanding with Large Language Models [83.11094441893435]
VideoStreaming is an advanced vision-language large model (VLLM) for video understanding.
It capably understands arbitrary-length video with a constant number of video streaming tokens encoded and propagatedly selected.
Our model achieves superior performance and higher efficiency on long video benchmarks.
arXiv Detail & Related papers (2024-05-25T02:22:09Z) - Compressed Vision for Efficient Video Understanding [83.97689018324732]
We propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos.
We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks.
arXiv Detail & Related papers (2022-10-06T15:35:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.