Fugu-MT 論文翻訳(概要): Don't Pause: Streaming Video-Language Synchrony for Online Video Understanding

論文の概要: Don't Pause: Streaming Video-Language Synchrony for Online Video Understanding

arxiv url: http://arxiv.org/abs/2606.06991v1
Date: Fri, 05 Jun 2026 07:29:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-08 14:33:29.614185
Title: Don't Pause: Streaming Video-Language Synchrony for Online Video Understanding
Title（参考訳）: Don't Pause:オンラインビデオ理解のためにビデオランゲージ同期をストリーミングする
Authors: Zhenyu Yang, Kairui Zhang, Shengsheng Qian, Weiming Dong, Changsheng Xu,
Abstract要約: オンラインビデオ理解のための新しいパラダイム: SVLS(Streaming Video-Language Synchrony)を紹介する。 LyraVは、2つのコアイノベーションを備えた階層的なコントロールフレームワーク上に構築されたライブストリーミングアシスタントである。まず、フレーム駆動トランジションコントローラ(FDTC)は、いつ話を続けるか、新しいレスポンスを開始するか、沈黙を保つか、といった、高レベルのセマンティックな決定を行います。第二に、プラグアンドプレイの軽量予測モジュールであるStreaming Token Pacer (SToP)は、動的に言語生成率に適応し、視覚的コンテンツのペースにマッチする。
参考スコア（独自算出の注目度）: 69.296913137409
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Online Video Large Language Models (Video-LLMs) have advanced toward seamless human-AI interaction through frame-by-frame processing and proactive responding. However, a critical challenge remains in streaming scenarios: existing models typically pause video perception while generating responses, breaking real-time video-language synchrony and causing stutters. To address this, we introduce a novel paradigm for online video understanding: Streaming Video-Language Synchrony (SVLS), and present LyraV, a live streaming assistant built upon a hierarchical control framework with two core innovations. First, the Frame-Driven Transition Controller (FDTC), a training-free verification-based finite-state machine, makes high-level semantic decisions on when to continue speaking, start a new response, or stay silent. Second, the Streaming Token Pacer (SToP), a plug-and-play lightweight predictive module, dynamically adapts the language generation rate to match the pace of the visual content. Concretely, LyraV performs \emph{per-frame incremental, sub-budget decoding}: within each frame interval it emits only a small chunk of tokens that fits the real-time budget, so perception is never blocked for a full sentence. Together, these components enable LyraV to seamlessly interleave incoming video frames with generated word tokens, achieving a fine-grained synchrony. Extensive experiments conducted on five online and three offline benchmarks demonstrate that LyraV preserves the backbone's general understanding ability while substantially improving streaming synchrony and narrative fluency, delivering a 98.29\% synchrony with video playback and a real-time processing speed of 3.89 FPS. Interestingly, we observe an empirical capability in LyraV: dynamic reasoning over streaming tokens, enabling continuous interpretation and "thinking" alongside visual input.
Abstract（参考訳）: オンラインビデオ大言語モデル(ビデオ-LLM)は、フレーム・バイ・フレーム処理とプロアクティブ・レスポンスを通じて、シームレスな人間-AIインタラクションに向けて進歩している。既存のモデルでは、応答を生成しながらビデオの知覚を一時停止し、リアルタイムのビデオ言語同期を破り、混乱を引き起こすのが一般的である。オンラインビデオ理解のための新しいパラダイムとして,SVLS(Streaming Video-Language Synchrony)と,2つのコアイノベーションを備えた階層的制御フレームワーク上に構築されたライブストリーミングアシスタントLyraVを紹介する。まず、フレーム駆動トランジションコントローラ(FDTC)は、トレーニング不要な検証ベースの有限状態マシンで、いつ話を続けるか、新しい応答を開始するか、沈黙し続けるか、といった、高レベルな意味決定を行う。第二に、プラグアンドプレイの軽量予測モジュールであるStreaming Token Pacer (SToP)は、動的に言語生成率に適応し、視覚的コンテンツのペースにマッチする。具体的には、LyraV は \emph{per-frame incremental, sub-budget decoding} を実行する: 各フレーム間隔内では、リアルタイムの予算に適合するトークンの小さな塊だけを出力するので、完全な文では知覚がブロックされない。これらのコンポーネントを組み合わせることで、LyraVは入ってくるビデオフレームと生成されたワードトークンをシームレスにインターリーブし、きめ細かい同期を実現することができる。 5つのオンラインおよび3つのオフラインベンチマークで実施された大規模な実験は、LyraVがバックボーンの一般的な理解能力を保ちながら、ストリーミング同期と物語流速を大幅に改善し、98.29\%の同期とビデオ再生、リアルタイム処理速度が3.89 FPSであることを示した。興味深いことに、LyraVの実証的な能力として、ストリーミングトークンを動的に推論し、連続的な解釈を可能にし、視覚的な入力と共に「考える」ことができる。

論文の概要: Don't Pause: Streaming Video-Language Synchrony for Online Video Understanding

関連論文リスト