LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval
- URL: http://arxiv.org/abs/2505.15269v1
- Date: Wed, 21 May 2025 08:47:15 GMT
- Title: LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval
- Authors: Zhenyu Ning, Guangda Liu, Qihao Jin, Wenchao Ding, Minyi Guo, Jieru Zhao,
- Abstract summary: LiveVLM is a training-free framework specifically designed for streaming, online video understanding and real-time interaction.<n>LiveVLM constructs a streaming-oriented KV cache to process video streams in real-time, retain long-term video details and eliminate redundant KVs.<n>When a new question is proposed, LiveVLM incorporates an online question-answering process that efficiently fetches both short-term and long-term visual information.
- Score: 13.891391928767195
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent developments in Video Large Language Models (Video LLMs) have enabled models to process long video sequences and demonstrate remarkable performance. Nonetheless, studies predominantly focus on offline video question answering, neglecting memory usage and response speed that are essential in various real-world applications, such as Deepseek services, autonomous driving, and robotics. To mitigate these challenges, we propose $\textbf{LiveVLM}$, a training-free framework specifically designed for streaming, online video understanding and real-time interaction. Unlike existing works that process videos only after one question is posed, LiveVLM constructs an innovative streaming-oriented KV cache to process video streams in real-time, retain long-term video details and eliminate redundant KVs, ensuring prompt responses to user queries. For continuous video streams, LiveVLM generates and compresses video key-value tensors (video KVs) to reserve visual information while improving memory efficiency. Furthermore, when a new question is proposed, LiveVLM incorporates an online question-answering process that efficiently fetches both short-term and long-term visual information, while minimizing interference from redundant context. Extensive experiments demonstrate that LiveVLM enables the foundation LLaVA-OneVision model to process 44$\times$ number of frames on the same device, and achieves up to 5$\times$ speedup in response speed compared with SoTA online methods at an input of 256 frames, while maintaining the same or better model performance.
Related papers
- Flash-VStream: Efficient Real-Time Understanding for Long Video Streams [64.25549822010372]
Flash-VStream is a video language model capable of processing extremely long videos and responding to user queries in real time.<n>Compared to existing models, Flash-VStream achieves significant reductions in inference latency.
arXiv Detail & Related papers (2025-06-30T13:17:49Z) - QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design [54.38970077613728]
Long-video understanding has emerged as a crucial capability in real-world applications such as video surveillance, meeting summarization, educational lecture analysis, and sports broadcasting.<n>We propose QuickVideo, a system-algorithm co-design that substantially accelerates long-video understanding to support real-time downstream applications.
arXiv Detail & Related papers (2025-05-22T03:26:50Z) - VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers [23.541896057977745]
VideoScan is an efficient vision-language model (VLM) inference framework for real-time video interaction.<n>VideoScan employs a single semantic carrier token to represent each frame.
arXiv Detail & Related papers (2025-03-12T13:30:40Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - Streaming Video Question-Answering with In-context Video KV-Cache Retrieval [10.990431921021585]
We propose ReKV, a training-free approach that enables efficient streaming video question-answering (StreamingVQA)<n>Our approach analyzes long videos in a streaming manner, allowing for prompt responses as soon as user queries are received.
arXiv Detail & Related papers (2025-03-01T15:53:33Z) - Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge [57.01131456894516]
Current video understanding models struggle with processing long video sequences, supporting multi-turn dialogues, and adapting to real-world dynamic scenarios.<n>We propose StreamChat, a training-free framework for streaming video reasoning and conversational interaction.<n>Our framework incorporates a parallel system scheduling strategy that enhances processing speed and reduces latency, ensuring robust performance in real-world applications.
arXiv Detail & Related papers (2025-01-23T08:33:10Z) - Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs.
Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z) - Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams [78.72965584414368]
We present Flash-VStream, a video-language model that simulates the memory mechanism of human.
Compared to existing models, Flash-VStream achieves significant reductions in latency inference and VRAM consumption.
We propose VStream-QA, a novel question answering benchmark specifically designed for online video streaming understanding.
arXiv Detail & Related papers (2024-06-12T11:07:55Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.