StreamForest: Efficient Online Video Understanding with Persistent Event Memory
- URL: http://arxiv.org/abs/2509.24871v1
- Date: Mon, 29 Sep 2025 14:53:57 GMT
- Title: StreamForest: Efficient Online Video Understanding with Persistent Event Memory
- Authors: Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, Yi Wang, Limin Wang,
- Abstract summary: StreamForest is designed for streaming video understanding.<n>Fine-grained Spatiotemporal Window captures detailed short-term visual cues to improve current scene perception.<n>OnlineIT significantly boosts MLLM performance in both real-time perception and future prediction.
- Score: 37.73273040737155
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in video understanding. However, their effectiveness in real-time streaming scenarios remains limited due to storage constraints of historical visual features and insufficient real-time spatiotemporal reasoning. To address these challenges, we propose StreamForest, a novel architecture specifically designed for streaming video understanding. Central to StreamForest is the Persistent Event Memory Forest, a memory mechanism that adaptively organizes video frames into multiple event-level tree structures. This process is guided by penalty functions based on temporal distance, content similarity, and merge frequency, enabling efficient long-term memory retention under limited computational resources. To enhance real-time perception, we introduce a Fine-grained Spatiotemporal Window, which captures detailed short-term visual cues to improve current scene perception. Additionally, we present OnlineIT, an instruction-tuning dataset tailored for streaming video tasks. OnlineIT significantly boosts MLLM performance in both real-time perception and future prediction. To evaluate generalization in practical applications, we introduce ODV-Bench, a new benchmark focused on real-time streaming video understanding in autonomous driving scenarios. Experimental results demonstrate that StreamForest achieves the state-of-the-art performance, with accuracies of 77.3% on StreamingBench, 60.5% on OVBench, and 55.6% on OVO-Bench. In particular, even under extreme visual token compression (limited to 1024 tokens), the model retains 96.8% of its average accuracy in eight benchmarks relative to the default setting. These results underscore the robustness, efficiency, and generalizability of StreamForest for streaming video understanding.
Related papers
- Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams [11.495597616926274]
Event-VStream represents continuous video as a sequence of discrete, semantically coherent events.<n>System detects meaningful state transitions by integrating motion, semantic, and predictive cues.<n>System maintains around 70% GPT-5 win rate on 2-hour Ego4D streams.
arXiv Detail & Related papers (2026-01-22T05:05:53Z) - HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding [92.59317281526239]
HERMES is a training-free architecture for real-time and accurate understanding of video streams.<n>Hermes reuses a compact KV cache, enabling efficient streaming understanding under resource constraints.<n>Hermes achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.
arXiv Detail & Related papers (2026-01-21T07:26:15Z) - StreamingVLM: Real-Time Understanding for Infinite Video Streams [23.94087606884915]
StreamingVLM is a model designed for real-time, stable understanding of infinite visual input.<n>Our approach is a unified framework that aligns training with streaming inference.<n>On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100.
arXiv Detail & Related papers (2025-10-10T17:59:58Z) - Flash-VStream: Efficient Real-Time Understanding for Long Video Streams [64.25549822010372]
Flash-VStream is a video language model capable of processing extremely long videos and responding to user queries in real time.<n>Compared to existing models, Flash-VStream achieves significant reductions in inference latency.
arXiv Detail & Related papers (2025-06-30T13:17:49Z) - TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos [47.91239059703758]
TimeChat-Online is a novel online VideoLLM that revolutionizes real-time video interaction.<n>Our Differential Token Drop (DTD) module addresses the challenge of visual redundancy in streaming videos.<n>Our experiments demonstrate that DTD achieves an 82.8% reduction in video tokens while maintaining 98% performance on StreamingBench.
arXiv Detail & Related papers (2025-04-24T07:59:46Z) - Exploiting Temporal State Space Sharing for Video Semantic Segmentation [53.8810901249897]
Video semantic segmentation (VSS) plays a vital role in understanding the temporal evolution of scenes.<n>Traditional methods often segment videos frame-by-frame or in a short temporal window, leading to limited temporal context, redundant computations, and heavy memory requirements.<n>We introduce a Temporal Video State Space Sharing architecture to leverage Mamba state space models for temporal feature sharing.<n>Our model features a selective gating mechanism that efficiently propagates relevant information across video frames, eliminating the need for a memory-heavy feature pool.
arXiv Detail & Related papers (2025-03-26T01:47:42Z) - VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers [23.541896057977745]
VideoScan is an efficient vision-language model (VLM) inference framework for real-time video interaction.<n>VideoScan employs a single semantic carrier token to represent each frame.
arXiv Detail & Related papers (2025-03-12T13:30:40Z) - STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges [39.666361965650836]
VideoLLaMB is a framework for long video understanding.<n> SceneTiling algorithm segments videos into coherent semantic units.<n>VideoLLaMB processes up to 320 frames using a single Nvidia A100 GPU.
arXiv Detail & Related papers (2024-09-02T08:52:58Z) - Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams [78.72965584414368]
We present Flash-VStream, a video-language model that simulates the memory mechanism of human.
Compared to existing models, Flash-VStream achieves significant reductions in latency inference and VRAM consumption.
We propose VStream-QA, a novel question answering benchmark specifically designed for online video streaming understanding.
arXiv Detail & Related papers (2024-06-12T11:07:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.