Fugu-MT 論文翻訳(概要): Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously

論文の概要: Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously

arxiv url: http://arxiv.org/abs/2603.12262v1
Date: Thu, 12 Mar 2026 17:59:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:26.295843
Title: Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously
Title（参考訳）: ビデオストリーミング:ビデオLLMは同時に見ることができる
Authors: Yiran Guan, Liang Yin, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Jian Luan, Yuliang Liu, Xiang Bai,
Abstract要約: Video Streaming Thinking (VST) はビデオ理解のための新しいパラダイムである。ストリーミング中のビデオクリップの推論を起動するメカニズムを視聴しながら思考をサポートする。 VSTはリアルタイム応答性を維持しながら、タイムリーな理解とコヒーレント認知を改善する。
参考スコア（独自算出の注目度）: 69.0264594684213
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response latency. To address this trade-off, we propose Video Streaming Thinking (VST), a novel paradigm for streaming video understanding. It supports a thinking while watching mechanism, which activates reasoning over incoming video clips during streaming. This design improves timely comprehension and coherent cognition while preserving real-time responsiveness by amortizing LLM reasoning latency over video playback. Furthermore, we introduce a comprehensive post-training pipeline that integrates VST-SFT, which structurally adapts the offline VideoLLM to causal streaming reasoning, and VST-RL, which provides end-to-end improvement through self-exploration in a multi-turn video interaction environment. Additionally, we devise an automated training-data synthesis pipeline that uses video knowledge graphs to generate high-quality streaming QA pairs, with an entity-relation grounded streaming Chain-of-Thought to enforce multi-evidence reasoning and sustained attention to the video stream. Extensive evaluations show that VST-7B performs strongly on online benchmarks, e.g. 79.5% on StreamingBench and 59.3% on OVO-Bench. Meanwhile, VST remains competitive on offline long-form or reasoning benchmarks. Compared with Video-R1, VST responds 15.7 times faster and achieves +5.4% improvement on VideoHolmes, demonstrating higher efficiency and strong generalization across diverse video understanding tasks. Code, data, and models will be released at https://github.com/1ranGuan/VST.
Abstract（参考訳）: オンラインビデオ大言語モデル(VideoLLMs)は、応答性、リアルタイムインタラクションをサポートする上で重要な役割を果たす。既存の方法は、同期論理推論ストリームを欠いたストリーミング知覚に焦点を当てている。しかし、テスト時間スケーリングメソッドを直接適用すると、許容できない応答遅延が発生する。このトレードオフに対処するため,ビデオ理解のための新しいパラダイムであるVST(Video Streaming Thinking)を提案する。ストリーミング中のビデオクリップの推論を起動するメカニズムを視聴しながら思考をサポートする。この設計は、ビデオ再生よりもLCM推論遅延を補正することにより、リアルタイムの応答性を保ちながら、タイムリーな理解とコヒーレント認知を改善する。さらに,VST-SFTとVST-RLを統合し,オフラインビデオLLMを因果的ストリーミング推論に構造的に適応させ,マルチターンビデオインタラクション環境における自己探索によるエンドツーエンド改善を実現する。さらに、ビデオ知識グラフを用いて高品質なストリーミングQAペアを生成する自動トレーニングデータ合成パイプラインを考案し、エンティティ関連のストリーミングChain-of-Thoughtにより、マルチエビデンス推論を強制し、ビデオストリームに注意を向ける。大規模な評価では、VST-7Bは、StreamingBenchで79.5%、OVO-Benchで59.3%、オンラインベンチマークで強く機能している。一方、VSTはオフラインのロングフォームや推論ベンチマークで競争力を維持している。 Video-R1と比較して、VSTは15.7倍高速で、VideoHolmesで+5.4%改善し、様々なビデオ理解タスクにおいて高い効率と強力な一般化を示す。コード、データ、モデルはhttps://github.com/1ranGuan/VSTでリリースされる。

論文の概要: Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously

関連論文リスト