Fugu-MT 論文翻訳(概要): LiveStarPro: Proactive Streaming Video Understanding with Hierarchical Memory for Long-Horizon Streams

論文の概要: LiveStarPro: Proactive Streaming Video Understanding with Hierarchical Memory for Long-Horizon Streams

arxiv url: http://arxiv.org/abs/2606.17798v1
Date: Tue, 16 Jun 2026 11:18:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-17 17:15:32.397669
Title: LiveStarPro: Proactive Streaming Video Understanding with Hierarchical Memory for Long-Horizon Streams
Title（参考訳）: LiveStarPro: 時系列ストリームのための階層記憶によるプロアクティブストリーミングビデオ理解
Authors: Zhenyu Yang, Kairui Zhang, Bing Wang, Shengsheng Qian, Changsheng Xu,
Abstract要約: このLiveStarProは、長時間のストリーミング上でのプロアクティブなビデオ理解のために設計されたライブストリーミングアシスタントである。 LiveStarProは既存のメソッドを一貫して上回り、セマンティックな正確性は28.9%向上した。そのストリーミングキーバリューキャッシュは、キャッシュなしで同じモデル上で1.58倍の推論速度を得る。
参考スコア（独自算出の注目度）: 59.485485426790966
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the remarkable progress of Video Large Language Models (Video-LLMs), current online architectures still struggle to simultaneously process continuous video streams, decide autonomously when to respond, and preserve long-horizon contextual memory. These obstacles undermine real-time responsiveness and cause severe forgetting throughout prolonged interactions. In this work, we introduce LiveStarPro, a live streaming assistant that is designed for proactive video understanding over long-horizon streams. The design of LiveStarPro rests on three complementary components. The first component is Streaming Verification Decoding (SVeD), an inference framework that identifies the appropriate response timing through single-pass perplexity verification, thereby eliminating the dependency on explicit silence tokens. The second component is Streaming Causal Attention Masks (SCAM), a training strategy that enforces incremental video-language alignment over variable-length streams. The third component is Tree-Structured Hierarchical Memory (TSHM), a recursive memory architecture that organizes evicted historical information into event chains and consequently enables efficient retrieval from effectively unbounded video streams. To facilitate a comprehensive evaluation under realistic online conditions, we further present OmniStarPro, a large-scale benchmark that spans 15 diverse real-world scenarios and that extends to hour-scale streams for the assessment of long-term recall. Extensive experiments demonstrate that LiveStarPro consistently surpasses existing methods, attaining a 28.9% improvement in semantic correctness and an 18.2% reduction in timing error, while its streaming key-value cache further yields a 1.58x inference speedup over the same model without caching. The model and the code are publicly available at https://github.com/sotayang/LiveStarPro.
Abstract（参考訳）: ビデオ大言語モデル(Video Large Language Models, Video-LLMs)の顕著な進歩にもかかわらず、現在のオンラインアーキテクチャは、継続的ビデオストリームの同時処理、応答のタイミングの自律決定、長期記憶の保存に苦慮している。これらの障害は、リアルタイムの応答性を損なうとともに、長時間の相互作用を通じて深刻な忘れを生じさせる。本研究では,長時間のストリーミング上でのプロアクティブなビデオ理解を目的としたライブストリーミングアシスタントであるLiveStarProを紹介する。 LiveStarProの設計は3つの補完的なコンポーネントに依存している。第1のコンポーネントであるStreaming Verification Decoding (SVeD)は,単一パスのパープレキシティ検証を通じて適切な応答タイミングを識別する推論フレームワークである。第2のコンポーネントは Streaming Causal Attention Masks (SCAM) である。第3のコンポーネントであるTree-Structured Hierarchical Memory (TSHM) は、過去の情報をイベントチェーンに整理した再帰的メモリアーキテクチャである。現実的なオンライン条件下での総合的な評価を容易にするため,15の多様な実世界のシナリオにまたがる大規模ベンチマークであるOmniStarProを,長期的リコール評価のために1時間単位のストリームに拡張した。大規模な実験により、LiveStarProは既存の手法を一貫して上回り、セマンティックな正確さが28.9%向上し、タイミングエラーが18.2%減少した。モデルとコードはhttps://github.com/sotayang/LiveStarProで公開されている。

論文の概要: LiveStarPro: Proactive Streaming Video Understanding with Hierarchical Memory for Long-Horizon Streams

関連論文リスト