Fugu-MT 論文翻訳(概要): PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

論文の概要: PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

arxiv url: http://arxiv.org/abs/2606.02443v1
Date: Mon, 01 Jun 2026 16:14:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:32.494528
Title: PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning
Title（参考訳）: PaSBench-Video: プロアクティブな安全警告のためのストリーミングビデオベンチマーク
Authors: Yusong Zhao, Yuejin Xie, Youliang Yuan, Junjie Hu, Jitian Guo, Yujiu Yang, Pinjia He,
Abstract要約: ビデオ対応の大型言語モデル(MLLM)は、このウィンドウ内で警告を発する常時オンの安全モニタとして機能する。今回紹介するPaSBench-Videoは、481のリスクと4つのドメインにわたる259のリスクビデオを備えた740のビデオベンチマークだ。最も厳密な基準ではモデルが20.0%を超えることはなく、リコールは偽陽性率と強く結びついている。
参考スコア（独自算出の注目度）: 54.70033525672316
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Between the first visible sign of danger and the moment an accident occurs, there is often a window where intervention remains possible. Video-capable multimodal large language models (MLLMs) could serve as always-on safety monitors that issue warnings during this window. Yet current benchmarks do not test this ability: they rely on static inputs, ignore timing precision, and omit false-positive measurement on safe scenes. We present PaSBench-Video, a 740-video benchmark with 481 risk and 259 no-risk videos across four domains: driving, healthcare, daily life, and industrial production. Risk videos are annotated with frame-level risk onset and accident boundaries. A model must observe the video causally and produce a warning that is both temporally calibrated and content-correct. Testing 13 MLLMs, we find that no model exceeds 20.0% on our strictest metric, and recall is tightly coupled with false-positive rate, with Pearson correlation 0.64: higher detection comes only at the cost of triggering warnings on the majority of safe clips. Performance splits sharply by domain: models achieve moderate recall at low false-positive rates in daily life, where risks are inherently anomalous, yet fire indiscriminately in driving, where routine and hazardous scenes look alike. These results indicate that current models rely on scene-level activity cues rather than reasoning about emerging harm.
Abstract（参考訳）: 最初の目に見える危険の兆候と事故が起きた瞬間の間には、しばしば介入が可能である窓があります。ビデオ対応マルチモーダル大言語モデル(MLLM)は、このウィンドウ中に警告を発行する常時オンの安全モニタとして機能する。静的な入力に依存し、タイミングの正確さを無視し、安全なシーンでの偽陽性測定を省略する。 PaSBench-Videoは、481のリスクと、運転、ヘルスケア、日常生活、産業生産の4つの領域にわたる259のリスクビデオを備えた740のビデオベンチマークである。リスクビデオには、フレームレベルのリスク設定と事故境界が注釈付けされている。モデルはビデオを慎重に観察し、時間的調整とコンテンツ修正の両方の警告を生成する必要がある。 13のMLLMをテストすると、最も厳密な基準ではモデルが20.0%を超えず、リコールは偽陽性率と強く結びついていることがわかった。モデルは日々の生活において低い偽陽性率で適度なリコールを達成し、リスクは本質的に異常であるが、運転時に無差別に発火する。これらの結果から,現在のモデルでは,害の発生を推論するよりも,シーンレベルの活動手段に頼っていることが示唆された。

論文の概要: PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

関連論文リスト