Related papers: Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams

Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams

URL: http://arxiv.org/abs/2601.15655v1
Date: Thu, 22 Jan 2026 05:05:53 GMT
Title: Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams
Authors: Zhenghui Guo, Yuanbin Man, Junyuan Sheng, Bowen Lin, Ahmed Ahmed, Bo Jiang, Boyuan Zhang, Miao Yin, Sian Jin, Omprakash Gnawal, Chengming Zhang,
Abstract summary: Event-VStream represents continuous video as a sequence of discrete, semantically coherent events.<n>System detects meaningful state transitions by integrating motion, semantic, and predictive cues.<n>System maintains around 70% GPT-5 win rate on 2-hour Ego4D streams.
Score: 11.495597616926274
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Real-time understanding of long video streams remains challenging for multimodal large language models (VLMs) due to redundant frame processing and rapid forgetting of past context. Existing streaming systems rely on fixed-interval decoding or cache pruning, which either produce repetitive outputs or discard crucial temporal information. We introduce Event-VStream, an event-aware framework that represents continuous video as a sequence of discrete, semantically coherent events. Our system detects meaningful state transitions by integrating motion, semantic, and predictive cues, and triggers language generation only at those boundaries. Each event embedding is consolidated into a persistent memory bank, enabling long-horizon reasoning while maintaining low latency. Across OVOBench-Realtime, and long-form Ego4D evaluations, Event-VStream achieves competitive performance. It improves over a VideoLLM-Online-8B baseline by +10.4 points on OVOBench-Realtime, achieves performance close to Flash-VStream-7B despite using only a general-purpose LLaMA-3-8B text backbone, and maintains around 70% GPT-5 win rate on 2-hour Ego4D streams.

Related papers

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding [92.59317281526239]
HERMES is a training-free architecture for real-time and accurate understanding of video streams.<n>Hermes reuses a compact KV cache, enabling efficient streaming understanding under resource constraints.<n>Hermes achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.
arXiv Detail & Related papers (2026-01-21T07:26:15Z)
video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory [51.03819128505358]
Video-SALMONN S is first to process 3-hour videos at 1 FPS and 360p resolution under a fixed memory budget.<n>A test-time-training memory module continually updates token representations to capture long-range dependencies.<n>A prompt-dependent memory reader retrieves context-relevant content from fixed-size memory.
arXiv Detail & Related papers (2025-10-13T08:20:15Z)
StreamingVLM: Real-Time Understanding for Infinite Video Streams [23.94087606884915]
StreamingVLM is a model designed for real-time, stable understanding of infinite visual input.<n>Our approach is a unified framework that aligns training with streaming inference.<n>On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100.
arXiv Detail & Related papers (2025-10-10T17:59:58Z)
StreamForest: Efficient Online Video Understanding with Persistent Event Memory [37.73273040737155]
StreamForest is designed for streaming video understanding.<n>Fine-grained Spatiotemporal Window captures detailed short-term visual cues to improve current scene perception.<n>OnlineIT significantly boosts MLLM performance in both real-time perception and future prediction.
arXiv Detail & Related papers (2025-09-29T14:53:57Z)
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling [27.468345201477504]
Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions.<n>We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs.<n> Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment.
arXiv Detail & Related papers (2025-07-07T17:49:41Z)
Flash-VStream: Efficient Real-Time Understanding for Long Video Streams [64.25549822010372]
Flash-VStream is a video language model capable of processing extremely long videos and responding to user queries in real time.<n>Compared to existing models, Flash-VStream achieves significant reductions in inference latency.
arXiv Detail & Related papers (2025-06-30T13:17:49Z)
VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers [23.541896057977745]
VideoScan is an efficient vision-language model (VLM) inference framework for real-time video interaction.<n>VideoScan employs a single semantic carrier token to represent each frame.
arXiv Detail & Related papers (2025-03-12T13:30:40Z)
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding [56.78088668917983]
We introduce SVBench, a pioneering benchmark with temporal multi-turn question-answering chains.<n>We design a semi-automated annotation pipeline to obtain 49,979 Question-Answer (QA) pairs of 1,353 streaming videos.<n>Our experimental results, obtained from 14 models in dialogue and streaming evaluations, reveal that while the closed-source GPT-4o outperforms others, most open-source LVLMs struggle with long-context streaming video understanding.
arXiv Detail & Related papers (2025-02-15T14:29:44Z)
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge [57.01131456894516]
Current video understanding models struggle with processing long video sequences, supporting multi-turn dialogues, and adapting to real-world dynamic scenarios.<n>We propose StreamChat, a training-free framework for streaming video reasoning and conversational interaction.<n>Our framework incorporates a parallel system scheduling strategy that enhances processing speed and reduces latency, ensuring robust performance in real-world applications.
arXiv Detail & Related papers (2025-01-23T08:33:10Z)
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams [78.72965584414368]
We present Flash-VStream, a video-language model that simulates the memory mechanism of human. Compared to existing models, Flash-VStream achieves significant reductions in latency inference and VRAM consumption. We propose VStream-QA, a novel question answering benchmark specifically designed for online video streaming understanding.
arXiv Detail & Related papers (2024-06-12T11:07:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.