Fugu-MT 論文翻訳(概要): STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding

論文の概要: STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding

arxiv url: http://arxiv.org/abs/2603.27593v1
Date: Sun, 29 Mar 2026 09:23:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:45.035905
Title: STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding
Title（参考訳）: STRIDE: ストリーミングビデオ理解のためのシークエンス・デノジング(Sequence Denoising)の話題
Authors: Junho Kim, Hosu Lee, James M. Rehg, Minsu Kim, Yong Man Ro,
Abstract要約: 現実のデプロイメントでは、ストリーミングの認識とプロアクティブなインタラクションがますます必要になります。本研究では、構造化シーケンスモデリング問題として、ストリーミングビデオのアクティブなアクティベーションを再考する。本稿では,アクティベーションインタフェースに軽量なマスク付き拡散モジュールを用いて,アクティベーション信号を共同で予測し,段階的に洗練するSTRIDEを提案する。
参考スコア（独自算出の注目度）: 77.20037111885226
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent progress in video large language models (Video-LLMs) has enabled strong offline reasoning over long and complex videos. However, real-world deployments increasingly require streaming perception and proactive interaction, where video frames arrive online and the system must decide not only what to respond, but also when to respond. In this work, we revisit proactive activation in streaming video as a structured sequence modeling problem, motivated by the observation that temporal transitions in streaming video naturally form span-structured activation patterns. To capture this span-level structure, we model activation signals jointly over a sliding temporal window and update them iteratively as new frames arrive. We propose STRIDE (Structured Temporal Refinement with Iterative DEnoising), which employs a lightweight masked diffusion module at the activation interface to jointly predict and progressively refine activation signals across the window. Extensive experiments on diverse streaming benchmarks and downstream models demonstrate that STRIDE shows more reliable and temporally coherent proactive responses, significantly improving when-to-speak decision quality in online streaming scenarios.
Abstract（参考訳）: ビデオ大言語モデル(ビデオ-LLM)の最近の進歩は、長く複雑なビデオに対して強いオフライン推論を可能にしている。しかし、現実のデプロイメントでは、ストリーミングの認識とプロアクティブなインタラクションがますます必要になり、そこではビデオフレームがオンラインに届き、システムは応答するだけでなく、応答するタイミングも決めなければならない。本研究では,ストリーミングビデオにおける時間的遷移が自然にスパン構造的アクティベーションパターンを形成するという観察に動機づけられた,構造化シーケンスモデリング問題として,ストリーミングビデオの積極的なアクティベーションを再考する。このスパンレベルの構造を捉えるため、スライディング時間窓上でアクティベーション信号を共同でモデル化し、新しいフレームが到着すると繰り返し更新する。本稿では,アクティベーションインターフェースに軽量なマスク付き拡散モジュールを用いて,ウィンドウ全体のアクティベーション信号を協調的に予測し,段階的に洗練するSTRIDE(Structured Temporal Refinement with Iterative Denoising)を提案する。多様なストリーミングベンチマークとダウンストリームモデルに関する大規模な実験は、STRIDEがより信頼性が高く、時間的に一貫性のあるプロアクティブ応答を示し、オンラインストリーミングのシナリオにおける時対話者の意思決定品質を著しく改善していることを示している。

論文の概要: STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding

関連論文リスト