Fugu-MT 論文翻訳(概要): Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions

論文の概要: Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions

arxiv url: http://arxiv.org/abs/2604.18459v1
Date: Mon, 20 Apr 2026 16:15:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.985818
Title: Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions
Title（参考訳）: Evidence-Aligned Timing and Transparent Decisions を用いたプログレッシブオンラインビデオ理解
Authors: Kecheng Zhang, Zongxin Yang, Mingfei Han, Haihong Hao, Yunzhi Zhuge, Changlin Li, Junhan Zhao, Zhihui Li, Xiaojun Chang,
Abstract要約: textbfmodelは、メモリ統合から推論制御を分離するフレームワークである。 emphActive Thinking Decision Maker (ATDM)は、決定プロセスの外部化を行う透明な推論コントローラである。 emphHierarchical Progressive Semantic Integration (HPSI)モジュールは効率的なメモリシステムとして機能する。
参考スコア（独自算出の注目度）: 75.23170605943457
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual agents operating in the wild must respond to queries precisely when sufficient evidence first appears in a video stream, a critical capability that is overlooked by conventional video LLMs evaluated in offline settings. The shift to an online, streaming paradigm introduces significant challenges: a lack of decision transparency, the difficulty of aligning response timing with visual evidence, and the need to maintain a global, causally consistent understanding under tight computational budgets. To address these issues, we propose a novel framework that decouples reasoning control from memory integration. We introduce \textbf{\model{}}, an instantiation of this framework with two core components. First, the \emph{Active Thinking Decision Maker (ATDM)} is a transparent reasoning controller that externalizes its decision process using observable progress ($\boldsymbolρ$) and confidence ($\boldsymbol{c}$) metrics. This allows it to precisely time its response $t_r$ to match the first-sufficient-evidence timestamp $t^\star$ while streaming its reasoning to the user. Second, the \emph{Hierarchical Progressive Semantic Integration (HPSI)} module acts as an efficient memory system. It employs a set of learnable, multi-level aggregation tokens that are propagated across clips to build a rich, global cognitive state without exceeding token budgets. %Our approach sets a new standard on key online video understanding benchmarks, achieving strong performance of \textbf{71.6\%} on StreamingBench and \textbf{46.9\%} on OVOBench, demonstrating a robust solution for evidence-aligned and transparent online video analysis. Extensive experiments demonstrate the effectiveness of ATDM and HPSI, e.g., Thinking-QwenVL improves the accuracy of the previous state-of-the-art from 67.63\% to 71.60\% on the StreamingBench benchmark.
Abstract（参考訳）: ビデオストリームに十分な証拠が最初に現れると、野生で動作している視覚エージェントは、オフライン設定で評価される従来のビデオLLMによって見過ごされる重要な機能であるクエリに正確に応答する必要がある。オンラインストリーミングパラダイムへの移行は、意思決定の透明性の欠如、応答タイミングと視覚的証拠の整合性の難しさ、厳密な計算予算の下でグローバルで因果的に一貫した理解を維持する必要性など、大きな課題をもたらす。これらの問題に対処するため、メモリ統合から推論制御を分離する新しいフレームワークを提案する。 2つのコアコンポーネントによるこのフレームワークのインスタンス化である、‘textbf{\model{}}’を紹介します。第一に、emph{Active Thinking Decision Maker (ATDM) は透明な推論コントローラで、観測可能な進捗(\boldsymbolρ$)と信頼(\boldsymbol{c}$)メトリクスを使用して意思決定プロセスを外部化する。これにより、レスポンス $t_r$ を正確にタイムスタンプ $t^\star$ にマッチさせ、その推論をユーザーにストリーミングすることができる。次に、emph{Hierarchical Progressive Semantic Integration (HPSI) モジュールは効率的なメモリシステムとして機能する。学習可能な多レベルアグリゲーショントークンのセットがクリップ全体に伝播し、トークン予算を超えることなく、リッチでグローバルな認知状態を構築する。 %Ourアプローチは、主要なオンラインビデオ理解ベンチマークに新しい標準を設定し、StreamingBench上での \textbf{71.6\%} と OVOBench上での \textbf{46.9\%} の強力なパフォーマンスを達成する。大規模な実験はATDMとHPSI、例えばThinking-QwenVLの有効性をStreamingBenchベンチマークで67.63\%から71.60\%に改善した。

論文の概要: Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions

関連論文リスト