Fugu-MT 論文翻訳(概要): Thinking in Streaming Video

論文の概要: Thinking in Streaming Video

arxiv url: http://arxiv.org/abs/2603.12938v1
Date: Fri, 13 Mar 2026 12:33:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-16 17:38:12.0841
Title: Thinking in Streaming Video
Title（参考訳）: ストリーミングビデオを考える
Authors: Zikang Liu, Longteng Guo, Handong Li, Ru Zhen, Xingjian He, Ruyi Ji, Xiaoming Ren, Yanhao Zhang, Haonan Lu, Jing Liu,
Abstract要約: ThinkStreamは、Watch-Think-Speakパラダイムに基づいた、ビデオ推論をストリーミングするためのフレームワークである。 Reasoning-Compressed Streaming Memory (RCSM) は、中間的推論トレースをコンパクトなセマンティックメモリとして扱う。複数のストリーミングビデオベンチマークの実験では、ThinkStreamが既存のオンラインビデオモデルを大幅に上回っていることが示されている。
参考スコア（独自算出の注目度）: 30.61790766076081
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Real-time understanding of continuous video streams is essential for interactive assistants and multimodal agents operating in dynamic environments. However, most existing video reasoning approaches follow a batch paradigm that defers reasoning until the full video context is observed, resulting in high latency and growing computational cost that are incompatible with streaming scenarios. In this paper, we introduce ThinkStream, a framework for streaming video reasoning based on a Watch--Think--Speak paradigm that enables models to incrementally update their understanding as new video observations arrive. At each step, the model performs a short reasoning update and decides whether sufficient evidence has accumulated to produce a response. To support long-horizon streaming, we propose Reasoning-Compressed Streaming Memory (RCSM), which treats intermediate reasoning traces as compact semantic memory that replaces outdated visual tokens while preserving essential context. We further train the model using a Streaming Reinforcement Learning with Verifiable Rewards scheme that aligns incremental reasoning and response timing with the requirements of streaming interaction. Experiments on multiple streaming video benchmarks show that ThinkStream significantly outperforms existing online video models while maintaining low latency and memory usage. Code, models and data will be released at https://github.com/johncaged/ThinkStream
Abstract（参考訳）: 動的環境で動作する対話型アシスタントやマルチモーダルエージェントには,連続ビデオストリームのリアルタイム理解が不可欠である。しかし、既存のビデオ推論アプローチのほとんどは、完全なビデオコンテキストが観察されるまで推論を無視するバッチパラダイムに従っており、結果としてレイテンシが高くなり、ストリーミングシナリオと互換性のない計算コストが増大する。本稿では,Watch-Think-Speakパラダイムに基づくストリーミングビデオ推論フレームワークであるThinkStreamを紹介する。各ステップにおいて、モデルは短い推論更新を行い、応答を生成するのに十分な証拠が蓄積されているかどうかを決定する。長期ストリーミングをサポートするために,中間的推論トレースを,不要な視覚トークンを置き換えるコンパクトなセマンティックメモリとして扱い,重要なコンテキストを保ちながら処理するReasoning-Compressed Streaming Memory (RCSM)を提案する。さらに,逐次的推論と応答タイミングをストリーミングインタラクションの要求と整合させる検証リワードスキームを用いたストリーミング強化学習を用いてモデルをトレーニングする。複数のストリーミングビデオベンチマークの実験は、ThinkStreamが既存のオンラインビデオモデルよりも大幅に優れ、低レイテンシとメモリ使用率を維持していることを示している。コード、モデル、データはhttps://github.com/johncaged/ThinkStreamで公開される

論文の概要: Thinking in Streaming Video

関連論文リスト