Fugu-MT 論文翻訳(概要): StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering

論文の概要: StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering

arxiv url: http://arxiv.org/abs/2605.25621v1
Date: Mon, 25 May 2026 09:23:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:19.547124
Title: StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering
Title（参考訳）: StreamOV:Evidence-Guided Memory and Response TriggeringによるOmni-Video理解のストリーミング
Authors: Ming Xie, Zizheng Huang, Xudong Tan, Chao Wang, Xiangyu Zeng, Wenxiao Wu, Tao Chen, Limin Wang, Yanwei Fu,
Abstract要約: StreamOVは、バウンドメモリとプロアクティブ応答トリガを備えた効率的なオンラインオーディオ視覚推論のための、新しいStreaming Omni-Video理解フレームワークである。応答のタイミングを決定するために、隠れ状態駆動のトリガーを使用しており、明示的なサイレントトーケン生成と外部ルータを避けている。さまざまなストリーミングとビデオのベンチマークで最先端のパフォーマンスを実現し、オンラインとオフラインの両方のビデオ理解に有効であることを実証している。
参考スコア（独自算出の注目度）: 39.92453666681465
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While streaming omni-video understanding demands continuous perception and proactive, real-time interaction, this crucial area remains largely under-explored. Current omni-modal methods are inherently designed for offline settings, limiting their applicability in streaming scenarios due to two fundamental flaws. First, they lack robust mechanisms to manage continuously growing audio-visual context over long horizons and cannot autonomously initiate responses at opportune moments. Second, existing benchmarks are predominantly confined to offline, single-turn question answering, failing to capture continuous, multi-turn streaming interactions. To bridge these gaps, we propose StreamOV, a novel Streaming Omni-Video understanding framework for efficient online audio-visual reasoning with bounded memory and proactive response triggering. Specifically, StreamOV introduces a multimodal evidence-guided long-short term memory that condenses historical audio-visual context into compact informative evidence under a fixed budget. It further employs a hidden-state-driven trigger to decide when to respond, avoiding explicit silence-token generation and external routers. We also curate SOVBench, the first comprehensive benchmark for online, multi-turn omni-modal evaluation. Extensive experiments show that StreamOV achieves state-of-the-art performance across diverse streaming and omni-video benchmarks, demonstrating its effectiveness for both online and offline video understanding.
Abstract（参考訳）: ストリーミング・オムニ・ビデオの理解には、継続的な知覚と積極的なリアルタイムの相互作用が要求されるが、この重要な領域は未探索のままである。現在のOmni-Modalメソッドは、本質的にオフライン設定用に設計されており、2つの根本的な欠陥のためにストリーミングシナリオにおける適用性を制限している。まず、長い地平線上で連続的に成長するオーディオ視覚コンテキストを管理するための堅牢なメカニズムが欠如しており、不透明な瞬間に自律的に応答を開始することができない。第二に、既存のベンチマークは、主にオフラインでシングルターンの質問応答に限られており、連続したマルチターンのストリーミングインタラクションをキャプチャできない。このギャップを埋めるため,拘束メモリとプロアクティブ応答トリガを用いた効率的なオンライン音声視覚推論のための,新しいストリームオムニビデオ理解フレームワークStreamOVを提案する。具体的には、StreamOVは、歴史的オーディオ視覚コンテキストを固定予算の下でコンパクトな情報的証拠に凝縮するマルチモーダルなエビデンス誘導長短メモリを導入している。さらに、いつ応答するかを判断するために、隠れ状態駆動のトリガーを採用し、明示的なサイレントトーケン生成と外部ルータを避ける。また、オンラインマルチターンオムニモーダル評価のための初の総合的なベンチマークであるSOVBenchをキュレートする。大規模な実験により、StreamOVはさまざまなストリーミングとビデオのベンチマークで最先端のパフォーマンスを達成し、オンラインとオフラインの両方のビデオ理解に有効であることを実証した。

論文の概要: StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering

関連論文リスト