StreamingVLM: Real-Time Understanding for Infinite Video Streams
- URL: http://arxiv.org/abs/2510.09608v1
- Date: Fri, 10 Oct 2025 17:59:58 GMT
- Title: StreamingVLM: Real-Time Understanding for Infinite Video Streams
- Authors: Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, Song Han,
- Abstract summary: StreamingVLM is a model designed for real-time, stable understanding of infinite visual input.<n>Our approach is a unified framework that aligns training with streaming inference.<n>On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100.
- Score: 23.94087606884915
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.
Related papers
- Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory [50.30283773196725]
Existing approaches rely on key-value caching to accumulate frame-level details over time, but use a limited number of tokens per frame.<n>We propose scaling the token budget to enable more granular-temporal understanding and reasoning.
arXiv Detail & Related papers (2026-02-20T18:59:50Z) - Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams [11.495597616926274]
Event-VStream represents continuous video as a sequence of discrete, semantically coherent events.<n>System detects meaningful state transitions by integrating motion, semantic, and predictive cues.<n>System maintains around 70% GPT-5 win rate on 2-hour Ego4D streams.
arXiv Detail & Related papers (2026-01-22T05:05:53Z) - HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding [92.59317281526239]
HERMES is a training-free architecture for real-time and accurate understanding of video streams.<n>Hermes reuses a compact KV cache, enabling efficient streaming understanding under resource constraints.<n>Hermes achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.
arXiv Detail & Related papers (2026-01-21T07:26:15Z) - MotionStream: Real-Time Video Generation with Interactive Motion Controls [60.403597895657505]
We present MotionStream, enabling sub-second latency with up to 29 FPS streaming generation on a single GPU.<n>Our approach begins by augmenting a text-to-video model with motion control, which generates high-quality videos that adhere to the global text prompt and local motion guidance, but does not perform inference on the fly.<n>Our models achieve state-of-the-art results in motion following and video quality while being two orders of magnitude faster, uniquely enabling infinite-length streaming.
arXiv Detail & Related papers (2025-11-03T06:37:53Z) - video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory [51.03819128505358]
Video-SALMONN S is first to process 3-hour videos at 1 FPS and 360p resolution under a fixed memory budget.<n>A test-time-training memory module continually updates token representations to capture long-range dependencies.<n>A prompt-dependent memory reader retrieves context-relevant content from fixed-size memory.
arXiv Detail & Related papers (2025-10-13T08:20:15Z) - StreamForest: Efficient Online Video Understanding with Persistent Event Memory [37.73273040737155]
StreamForest is designed for streaming video understanding.<n>Fine-grained Spatiotemporal Window captures detailed short-term visual cues to improve current scene perception.<n>OnlineIT significantly boosts MLLM performance in both real-time perception and future prediction.
arXiv Detail & Related papers (2025-09-29T14:53:57Z) - LongLive: Real-time Interactive Long Video Generation [68.45945318075432]
LongLive is a frame-level autoregressive framework for real-time and interactive long video generation.<n>LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos.
arXiv Detail & Related papers (2025-09-26T17:48:24Z) - Clapper: Compact Learning and Video Representation in VLMs [15.564506713994406]
Current vision-language models (VLMs) have demonstrated remarkable capabilities across diverse video understanding applications.<n>We propose Clapper, a method that utilizes a slow-fast strategy for video representation and introduces a novel module named TimePerceiver for efficient temporal-spatial encoding.
arXiv Detail & Related papers (2025-05-21T13:52:17Z) - LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval [13.891391928767195]
LiveVLM is a training-free framework specifically designed for streaming, online video understanding and real-time interaction.<n>LiveVLM constructs a streaming-oriented KV cache to process video streams in real-time, retain long-term video details and eliminate redundant KVs.<n>When a new question is proposed, LiveVLM incorporates an online question-answering process that efficiently fetches both short-term and long-term visual information.
arXiv Detail & Related papers (2025-05-21T08:47:15Z) - FastVID: Dynamic Density Pruning for Fast Video Large Language Models [38.267065642416554]
We propose Density Pruning for Fast Video LLMs termed FastVID.<n>FastVID partitions videos into temporally ordered segments to preserve temporal structure.<n>Our method significantly reduces computational overhead while maintaining temporal and visual integrity.
arXiv Detail & Related papers (2025-03-14T08:33:08Z) - STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - Looking Backward: Streaming Video-to-Video Translation with Feature Banks [65.46145157488344]
StreamV2V is a diffusion model that achieves real-time streaming video-to-video (V2V) translation with user prompts.<n>It can run 20 FPS on one A100 GPU, being 15x, 46x, 108x, and 158x faster than FlowVid, CoDeF, Rerender, and TokenFlow.
arXiv Detail & Related papers (2024-05-24T17:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.