Related papers: AccidentBench: Benchmarking Multimodal Understanding and Reasoning in Vehicle Accidents and Beyond

AccidentBench: Benchmarking Multimodal Understanding and Reasoning in Vehicle Accidents and Beyond

URL: http://arxiv.org/abs/2509.26636v1
Date: Tue, 30 Sep 2025 17:59:13 GMT
Title: AccidentBench: Benchmarking Multimodal Understanding and Reasoning in Vehicle Accidents and Beyond
Authors: Shangding Gu, Xiaohan Wang, Donghao Ying, Haoyu Zhao, Runing Yang, Ming Jin, Boyi Li, Marco Pavone, Serena Yeung-Levy, Jun Wang, Dawn Song, Costas Spanos,
Abstract summary: AccidentBench is a large-scale benchmark that combines vehicle accident scenarios with Beyond domains.<n>The benchmark contains approximately 2000 videos and over 19000 human-annotated question-answer pairs.
Score: 101.20320617562321
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Rapid advances in multimodal models demand benchmarks that rigorously evaluate understanding and reasoning in safety-critical, dynamic real-world settings. We present AccidentBench, a large-scale benchmark that combines vehicle accident scenarios with Beyond domains, safety-critical settings in air and water that emphasize spatial and temporal reasoning (e.g., navigation, orientation, multi-vehicle motion). The benchmark contains approximately 2000 videos and over 19000 human-annotated question--answer pairs spanning multiple video lengths (short/medium/long) and difficulty levels (easy/medium/hard). Tasks systematically probe core capabilities: temporal, spatial, and intent understanding and reasoning. By unifying accident-centric traffic scenes with broader safety-critical scenarios in air and water, AccidentBench offers a comprehensive, physically grounded testbed for evaluating models under real-world variability. Evaluations of state-of-the-art models (e.g., Gemini-2.5 Pro and GPT-5) show that even the strongest models achieve only about 18% accuracy on the hardest tasks and longest videos, revealing substantial gaps in real-world temporal, spatial, and intent reasoning. AccidentBench is designed to expose these critical gaps and drive the development of multimodal models that are safer, more robust, and better aligned with real-world safety-critical challenges. The code and dataset are available at: https://github.com/SafeRL-Lab/AccidentBench

Related papers

InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation [53.47253633654885]
InstaDrive is a novel framework that enhances driving video realism through two key advancements.<n>By incorporating these instance-aware mechanisms, InstaDrive achieves state-of-the-art video generation quality.<n>Our project page is https://shanpoyang654.io/InstaDrive/page.html.
arXiv Detail & Related papers (2026-02-03T08:22:13Z)
Towards Safer Mobile Agents: Scalable Generation and Evaluation of Diverse Scenarios for VLMs [10.48956192789531]
Vision Language Models (VLMs) are increasingly deployed in autonomous vehicles and mobile systems.<n>Current benchmarks inadequately cover diverse hazardous situations, especially anomalous scenarios with.<n>temporal dynamics.<n>We introduce textbfHazardForge, a scalable pipeline that leverages image editing models to generate.<n>scenarios with layout decision algorithms, and validation modules.
arXiv Detail & Related papers (2026-01-13T11:55:31Z)
SAVeD: A First-Person Social Media Video Dataset for ADAS-equipped vehicle Near-Miss and Crash Event Analyses [0.7874708385247353]
This paper introduces SAVeD, a large-scale video dataset curated from publicly available social media content.<n>SAVED features 2,119 first-person videos, capturing ADAS vehicle operations in diverse locations, lighting conditions, and weather scenarios.<n>The dataset includes video frame-level annotations for collisions, evasive maneuvers, and disengagements, enabling analysis of both perception and decision-making failures.
arXiv Detail & Related papers (2025-12-19T15:58:52Z)
IndustryNav: Exploring Spatial Reasoning of Embodied Agents in Dynamic Industrial Navigation [56.43007596544299]
IndustryNav is the first dynamic industrial navigation benchmark for active spatial reasoning.<n>A study of nine state-of-the-art Visual Large Language Models reveals that closed-source models maintain a consistent advantage.
arXiv Detail & Related papers (2025-11-21T16:48:49Z)
iSafetyBench: A video-language benchmark for safety in industrial environment [6.697702130929693]
iSafetyBench is a new video-language benchmark designed to evaluate model performance in industrial environments.<n>iSafetyBench comprises 1,100 video clips sourced from real-world industrial settings.<n>We evaluate eight state-of-the-art video-language models under zero-shot conditions.
arXiv Detail & Related papers (2025-08-01T07:55:53Z)
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning [54.47710436807661]
MORSE-500 is a video benchmark composed of 500 fully scripted clips embedded questions spanning six complementary reasoning categories.<n>Each instance is generated using deterministic Python scripts (Manim, Matplotlib, MoviePy), generative video models, and real footage.<n>Unlike static benchmarks that become obsolete once saturated, MORSE-500 is built to evolve.
arXiv Detail & Related papers (2025-06-05T19:12:45Z)
Learning collision risk proactively from naturalistic driving data at scale [3.1457219084519004]
This study introduces the Generalised Surrogate Safety Measure (GSSM)<n>GSSM learns collision risk from naturalistic driving without the need for crash or risk labels.<n> Diverse data from naturalistic driving, including motion kinematics, weather, lighting, etc., are used to train multiple GSSMs.<n>A basic GSSM using only instantaneous motion kinematics achieves an area under the precision-recall curve of 0.9 and secures a median time advance of 2.6 seconds to prevent potential collisions.
arXiv Detail & Related papers (2025-05-19T07:22:32Z)
CRASH: Crash Recognition and Anticipation System Harnessing with Context-Aware and Temporal Focus Attentions [13.981748780317329]
Accurately and promptly predicting accidents among surrounding traffic agents from camera footage is crucial for the safety of autonomous vehicles (AVs) This study introduces a novel accident anticipation framework for AVs, termed CRASH. It seamlessly integrates five components: object detector, feature extractor, object-aware module, context-aware module, and multi-layer fusion. Our model surpasses existing top baselines in critical evaluation metrics like Average Precision (AP) and mean Time-To-Accident (mTTA)
arXiv Detail & Related papers (2024-07-25T04:12:49Z)
ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos [53.92440577914417]
ACQUIRED consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints. Each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal. We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap.
arXiv Detail & Related papers (2023-11-02T22:17:03Z)
DeepAccident: A Motion and Accident Prediction Benchmark for V2X Autonomous Driving [76.29141888408265]
We propose a large-scale dataset containing diverse accident scenarios that frequently occur in real-world driving. The proposed DeepAccident dataset includes 57K annotated frames and 285K annotated samples, approximately 7 times more than the large-scale nuScenes dataset.
arXiv Detail & Related papers (2023-04-03T17:37:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.