Fugu-MT 論文翻訳(概要): VSAS-BENCH: Real-Time Evaluation of Visual Streaming Assistant Models

論文の概要: VSAS-BENCH: Real-Time Evaluation of Visual Streaming Assistant Models

arxiv url: http://arxiv.org/abs/2604.07634v1
Date: Wed, 08 Apr 2026 22:31:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:05.585813
Title: VSAS-BENCH: Real-Time Evaluation of Visual Streaming Assistant Models
Title（参考訳）: VSAS-BENCH:ビジュアルストリーミングアシスタントモデルのリアルタイム評価
Authors: Pavan Kumar Anasosalu Vasu, Cem Koc, Fartash Faghri, Chun-Liang Li, Bo Feng, Zhengfeng Lai, Meng Cao, Oncel Tuzel, Hadi Pouransari,
Abstract要約: ストリーム視覚言語モデル(VLM)は、命令プロンプトと入力フレームのオンラインストリームが与えられた応答を連続的に生成する。 Visual Streaming Assistantsの新しいフレームワークとベンチマークであるVSAS-Benchを提案する。
参考スコア（独自算出の注目度）: 39.78979236902648
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Streaming vision-language models (VLMs) continuously generate responses given an instruction prompt and an online stream of input frames. This is a core mechanism for real-time visual assistants. Existing VLM frameworks predominantly assess models in offline settings. In contrast, the performance of a streaming VLM depends on additional metrics beyond pure video understanding, including proactiveness, which reflects the timeliness of the model's responses, and consistency, which captures the robustness of its responses over time. To address this limitation, we propose VSAS-Bench, a new framework and benchmark for Visual Streaming Assistants. In contrast to prior benchmarks that primarily employ single-turn question answering on video inputs, VSAS-Bench features temporally dense annotations with over 18,000 annotations across diverse input domains and task types. We introduce standardized synchronous and asynchronous evaluation protocols, along with metrics that isolate and measure distinct capabilities of streaming VLMs. Using this framework, we conduct large-scale evaluations of recent video and streaming VLMs, analyzing the accuracy-latency trade-off under key design factors such as memory buffer length, memory access policy, and input resolution, yielding several practical insights. Finally, we show empirically that conventional VLMs can be adapted to streaming settings without additional training, and demonstrate that these adapted models outperform recent streaming VLMs. For example, Qwen3-VL-4B surpasses Dispider, the best streaming VLM on our benchmark, by 3% under the asynchronous protocol. The benchmark and code will be available at https://github.com/apple/ml-vsas-bench.
Abstract（参考訳）: ストリーム視覚言語モデル(VLM)は、命令プロンプトと入力フレームのオンラインストリームが与えられた応答を連続的に生成する。これはリアルタイムビジュアルアシスタントのコアメカニズムである。既存のVLMフレームワークは、主にオフライン設定でモデルを評価する。対照的に、ストリーミングVLMのパフォーマンスは、モデルの応答のタイムラインを反映する積極性や、時間の経過とともに応答の堅牢性をキャプチャする一貫性など、純粋なビデオ理解以上の追加のメトリクスに依存する。この制限に対処するため、Visual Streaming Assistantsの新しいフレームワークとベンチマークであるVSAS-Benchを提案する。ビデオ入力に単一ターンの質問応答を主とする以前のベンチマークとは対照的に、VSAS-Benchは、多様な入力ドメインとタスクタイプにまたがる18,000以上のアノテーションを備えた、時間的に密集したアノテーションを備えている。ストリーミングVLMの異なる機能を分離し、測定するメトリクスとともに、標準化された同期および非同期評価プロトコルを紹介します。このフレームワークを用いて,最近のビデオおよびストリーミングVLMの大規模評価を行い,メモリバッファ長,メモリアクセスポリシ,入力解像度といった重要な設計要素下での精度・レイテンシのトレードオフを分析し,いくつかの実用的な知見を得た。最後に、従来のVLMが追加のトレーニングなしでストリーミング設定に適応できることを実証的に示し、これらの適応モデルは最近のストリーミングVLMよりも優れていることを示す。例えば、Qwen3-VL-4Bは、我々のベンチマークで最高のストリーミングVLMであるDispiderを3%上回っています。ベンチマークとコードはhttps://github.com/apple/ml-vsas-bench.comで公開される。

論文の概要: VSAS-BENCH: Real-Time Evaluation of Visual Streaming Assistant Models

関連論文リスト