Fugu-MT 論文翻訳(概要): StreamForest: Efficient Online Video Understanding with Persistent Event Memory

論文の概要: StreamForest: Efficient Online Video Understanding with Persistent Event Memory

arxiv url: http://arxiv.org/abs/2509.24871v1
Date: Mon, 29 Sep 2025 14:53:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:20.060746
Title: StreamForest: Efficient Online Video Understanding with Persistent Event Memory
Title（参考訳）: StreamForest: 永続的なイベントメモリによる効率的なオンラインビデオ理解
Authors: Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, Yi Wang, Limin Wang,
Abstract要約: StreamForestは、ビデオの理解をストリーミングするために設計されている。微粒な時空間ウィンドウは、現在のシーン知覚を改善するために、詳細な短期的な視覚的手がかりをキャプチャする。 OnlineITはリアルタイム認識と将来の予測の両方においてMLLMのパフォーマンスを大幅に向上させる。
参考スコア（独自算出の注目度）: 37.73273040737155
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in video understanding. However, their effectiveness in real-time streaming scenarios remains limited due to storage constraints of historical visual features and insufficient real-time spatiotemporal reasoning. To address these challenges, we propose StreamForest, a novel architecture specifically designed for streaming video understanding. Central to StreamForest is the Persistent Event Memory Forest, a memory mechanism that adaptively organizes video frames into multiple event-level tree structures. This process is guided by penalty functions based on temporal distance, content similarity, and merge frequency, enabling efficient long-term memory retention under limited computational resources. To enhance real-time perception, we introduce a Fine-grained Spatiotemporal Window, which captures detailed short-term visual cues to improve current scene perception. Additionally, we present OnlineIT, an instruction-tuning dataset tailored for streaming video tasks. OnlineIT significantly boosts MLLM performance in both real-time perception and future prediction. To evaluate generalization in practical applications, we introduce ODV-Bench, a new benchmark focused on real-time streaming video understanding in autonomous driving scenarios. Experimental results demonstrate that StreamForest achieves the state-of-the-art performance, with accuracies of 77.3% on StreamingBench, 60.5% on OVBench, and 55.6% on OVO-Bench. In particular, even under extreme visual token compression (limited to 1024 tokens), the model retains 96.8% of its average accuracy in eight benchmarks relative to the default setting. These results underscore the robustness, efficiency, and generalizability of StreamForest for streaming video understanding.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は近年,ビデオ理解において顕著な進歩を遂げている。しかし、歴史的視覚的特徴の記憶制約とリアルタイム時空間的推論が不十分なため、リアルタイムストリーミングシナリオにおけるそれらの有効性は依然として制限されている。これらの課題に対処するために,ストリーミングビデオ理解に特化して設計された新しいアーキテクチャStreamForestを提案する。 StreamForestの中心はPersistent Event Memory Forestである。これは、ビデオフレームを複数のイベントレベルのツリー構造に適応的に整理するメモリメカニズムである。このプロセスは、時間的距離、内容の類似性、およびマージ周波数に基づくペナルティ関数によってガイドされ、限られた計算資源下での効率的な長期記憶保持を可能にする。リアルタイムの知覚を高めるために,時間空間の微粒化を導入し,より詳細な短期的な視覚的手がかりを捉え,現在のシーンの知覚を改善する。さらに、ストリーミングビデオタスクに適した命令チューニングデータセットであるOnlineITを提示する。 OnlineITはリアルタイム認識と将来の予測の両方においてMLLMのパフォーマンスを大幅に向上させる。実運用における一般化を評価するため,自律運転シナリオにおけるリアルタイムストリーミングビデオ理解に焦点を当てた新しいベンチマークであるODV-Benchを紹介した。実験の結果、StreamForestは最先端のパフォーマンスを実現しており、StreamingBenchは77.3%、OVBenchは60.5%、OVO-Benchは55.6%である。特に、極端な視覚的トークン圧縮(1024トークンに制限)の下でも、標準設定と比較して8つのベンチマークで平均精度の96.8%を維持している。これらの結果は、ストリーミングビデオ理解のためのStreamForestの堅牢性、効率性、一般化性を強調している。

論文の概要: StreamForest: Efficient Online Video Understanding with Persistent Event Memory

関連論文リスト