Fugu-MT 論文翻訳(概要): Harnessing Streaming Video in the Wild

論文の概要: Harnessing Streaming Video in the Wild

arxiv url: http://arxiv.org/abs/2606.08615v1
Date: Sun, 07 Jun 2026 13:00:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.300667
Title: Harnessing Streaming Video in the Wild
Title（参考訳）: 野生で動画をストリーミングする「Harnessing」
Authors: Dingyu Yao, Shuhuan Gu, Qingyi Si, Junhao Zhou, Chenxu Yang, Chuanyu Qin, Naibin Gu, Zheng Lin, Weiping Wang, Nan Duan, Jiaqi Wang,
Abstract要約: VLM(Vision-Language Models)は、ビデオコールアシスタント、ライブコメンタリー、エンボディロボットなどのアプリケーションでビデオストリームを処理するためにますます必要とされる。理想的なストリーミングシステムは、アクティブなインタラクション、長期メモリ、リアルタイム処理をサポートする必要がある。既存のVLMはオフラインのビデオ理解に優れていますが、ストリーミング機能に欠け、ストリーミングデプロイメント専用のインフラストラクチャが欠如しています。
参考スコア（独自算出の注目度）: 53.23721420272668
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) are increasingly required to process unbounded video streams in applications such as video-call assistants, live commentary, and embodied robots. An ideal streaming system should support proactive interaction, long-horizon memory, and real-time processing, while resting on a VLM backbone capable of handling diverse in-the-wild streaming tasks. However, existing VLMs excel at offline video understanding but fall short in streaming capabilities and lack dedicated infrastructure for streaming deployment. We address this gap on three fronts. (i) For backbone capability, we construct \textbf{Streaming-Train-248K}, a streaming dataset paired with a novel training objective for adapting VLMs to streaming interaction and understanding. (ii) For real-world deployment, we introduce \textbf{Streaming Harness}, a plug-and-play system that endows any VLM with three core abilities: proactive interaction (per-second response decisions), long-term memory (12-hour context retention), and real-time processing (sub-second latency). (iii) To drive continued community progress on streaming capabilities, we design \textbf{Streaming-Eval}, a benchmark that reflects models' capabilities across diverse in-the-wild scenarios. Extensive experiments demonstrate consistent gains from our approach across all core capabilities required for streaming video understanding. We will open-source our data, code, and benchmark to advance the community's shift from offline video understanding to deployable streaming intelligence.
Abstract（参考訳）: VLM(Vision-Language Models)は、ビデオコールアシスタント、ライブコメンタリー、エンボディロボットなどのアプリケーションにおいて、無制限のビデオストリームを処理するためにますます必要とされる。理想的なストリーミングシステムは、多種多様なWildストリーミングタスクを処理できるVLMバックボーン上で、プロアクティブなインタラクション、長い水平メモリ、リアルタイム処理をサポートする必要がある。しかしながら、既存のVLMはオフラインのビデオ理解に優れていますが、ストリーミング機能に欠け、ストリーミングデプロイメント専用のインフラストラクチャが欠如しています。このギャップを3つの面で解決する。 (i)バックボーン機能のために,ストリーミングインタラクションと理解にVLMを適用するための新たなトレーニング目標と組み合わせたストリーミングデータセットである‘textbf{Streaming-Train-248K} を構築した。 i) 実世界の展開には,プロアクティブインタラクション(秒単位の応答決定),長期メモリ(12時間のコンテキスト保持),リアルタイム処理(秒単位のレイテンシ)という,3つのコアを持つVLMを実現するプラグイン・アンド・プレイシステムである \textbf{Streaming Harness} を導入する。 (iii) ストリーミング機能に関するコミュニティの継続的な進展を促進するため、さまざまなWildシナリオにまたがるモデルの能力を反映したベンチマークである‘textbf{Streaming-Eval} を設計しました。大規模な実験は、ストリーミングビデオ理解に必要なすべてのコア機能に対して、我々のアプローチによる一貫した利益を示します。オフラインのビデオ理解からデプロイ可能なストリーミングインテリジェンスへの移行を進めるため、私たちのデータ、コード、ベンチマークをオープンソースとして公開します。

論文の概要: Harnessing Streaming Video in the Wild

関連論文リスト