Fugu-MT 論文翻訳(概要): StreamingVLM: Real-Time Understanding for Infinite Video Streams

論文の概要: StreamingVLM: Real-Time Understanding for Infinite Video Streams

arxiv url: http://arxiv.org/abs/2510.09608v1
Date: Fri, 10 Oct 2025 17:59:58 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 00:38:49.58714
Title: StreamingVLM: Real-Time Understanding for Infinite Video Streams
Title（参考訳）: StreamingVLM: 無限ビデオストリームのリアルタイム理解
Authors: Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, Song Han,
Abstract要約: StreamingVLMは、無限視覚入力のリアルタイムで安定した理解のために設計されたモデルである。私たちのアプローチは、トレーニングとストリーミング推論を整合させる統合フレームワークです。 Inf-Streams-Evalでは、StreamingVLMはGPT-4O miniに対して66.18%の勝利率を獲得し、1つのNVIDIA H100上で最大8FPSで安定したリアルタイムパフォーマンスを維持する。
参考スコア（独自算出の注目度）: 23.94087606884915
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.
Abstract（参考訳）: 視覚言語モデル(VLM)は、リアルタイムアシスタントと自律エージェントに電力を供給することができるが、それらは重要な課題に直面している。ビデオ全体をフルアテンションで処理することで、2次計算コストと長時間ビデオのパフォーマンスが低下する。一方、単純なスライディングウインドウ手法にも欠陥があり、コヒーレンスを壊すか、冗長な再計算のために高いレイテンシに悩まされる。本稿では,無限視覚入力のリアルタイムかつ安定した理解を目的としたモデルであるStreamingVLMを紹介する。私たちのアプローチは、トレーニングとストリーミング推論を整合させる統合フレームワークです。推論中は、注意シンクの状態の再利用、最近の視覚トークンのショートウィンドウ、最近のテキストトークンのロングウィンドウにより、コンパクトなKVキャッシュを維持する。このストリーミング能力は、短い重なり合ったビデオチャンクに十分に注意を向けるシンプルな教師付き微調整(SFT)戦略によって実現される。評価のために、Inf-Streams-Evalという、フレームとテキスト間の1秒あたりの高密度なアライメントを必要とするビデオ平均2時間以上の新しいベンチマークを構築した。 Inf-Streams-Evalでは、StreamingVLMはGPT-4O miniに対して66.18%の勝利率を獲得し、1つのNVIDIA H100上で最大8FPSで安定したリアルタイムパフォーマンスを維持する。我々のSFT戦略は、VQA固有の微調整なしに一般的なVQA能力を向上し、LongVideoBenchを+4.30倍、OVOBench Realtimeを+5.96倍に向上させる。コードはhttps://github.com/mit-han-lab/streaming-vlm.comで公開されている。

論文の概要: StreamingVLM: Real-Time Understanding for Infinite Video Streams

関連論文リスト