Fugu-MT 論文翻訳(概要): CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

論文の概要: CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

arxiv url: http://arxiv.org/abs/2604.08457v1
Date: Thu, 09 Apr 2026 16:52:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:06.03352
Title: CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning
Title（参考訳）: CrashSight: トラフィッククラッシュシーンの理解と推論のためのフェーズアウェアでインフラストラクチャ中心のビデオベンチマーク
Authors: Rui Gan, Junyi Ma, Pei Li, Xingyou Yang, Kai Chen, Sikai Chen, Bin Ran,
Abstract要約: textbfCrashSightは,現実の道路カメラデータを用いた道路事故理解のための視覚ベンチマークである。データセットは250のクラッシュビデオで構成されており、2階層の分類の下で構成された13Kの質問応答ペアが注釈付けされている。我々は8つの最先端のVLMをベンチマークし、強いシーン記述能力にもかかわらず、現在のモデルは安全クリティカルなシナリオにおける時間的・因果的推論に苦戦していることを示す。
参考スコア（独自算出の注目度）: 27.23760411917563
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cooperative autonomous driving requires traffic scene understanding from both vehicle and infrastructure perspectives. While vision-language models (VLMs) show strong general reasoning capabilities, their performance in safety-critical traffic scenarios remains insufficiently evaluated due to the ego-vehicle focus of existing benchmarks. To bridge this gap, we present \textbf{CrashSight}, a large-scale vision-language benchmark for roadway crash understanding using real-world roadside camera data. The dataset comprises 250 crash videos, annotated with 13K multiple-choice question-answer pairs organized under a two-tier taxonomy. Tier 1 evaluates the visual grounding of scene context and involved parties, while Tier 2 probes higher-level reasoning, including crash mechanics, causal attribution, temporal progression, and post-crash outcomes. We benchmark 8 state-of-the-art VLMs and show that, despite strong scene description capabilities, current models struggle with temporal and causal reasoning in safety-critical scenarios. We provide a detailed analysis of failure scenarios and discuss directions for improving VLM crash understanding. The benchmark provides a standardized evaluation framework for infrastructure-assisted perception in cooperative autonomous driving. The CrashSight benchmark, including the full dataset and code, is accessible at https://mcgrche.github.io/crashsight.
Abstract（参考訳）: 協調自動運転は、車両とインフラの両方の観点からの交通シーンの理解を必要とする。視覚言語モデル(VLM)は、強力な一般的な推論能力を示すが、既存のベンチマークのエゴサイクルの焦点のため、安全クリティカルな交通シナリオにおける性能評価は不十分である。このギャップを埋めるために,現実の道路カメラデータを用いた道路事故理解のための大規模視覚言語ベンチマークである‘textbf{CrashSight} を提示する。データセットは250のクラッシュビデオで構成されており、2階層の分類の下で構成された13Kの質問応答ペアが注釈付けされている。 Tier 1はシーンコンテキストと関係者の視覚的基盤を評価し、Tier 2はクラッシュメカニクス、因果帰属、時間的進行、クレーシュ後の結果など、より高いレベルの推論を探索する。我々は8つの最先端のVLMをベンチマークし、強いシーン記述能力にもかかわらず、現在のモデルは安全クリティカルなシナリオにおける時間的・因果的推論に苦戦していることを示す。障害シナリオを詳細に分析し、VLMのクラッシュ理解を改善するための方向性について議論する。このベンチマークは、協調自動運転におけるインフラ支援の認識のための標準化された評価フレームワークを提供する。完全なデータセットとコードを含むCrashSightベンチマークは、https://mcgrche.github.io/crashsightでアクセス可能である。

論文の概要: CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

関連論文リスト