Related papers: Investigating Traffic Accident Detection Using Multimodal Large Language Models

Investigating Traffic Accident Detection Using Multimodal Large Language Models

URL: http://arxiv.org/abs/2509.19096v2
Date: Wed, 24 Sep 2025 08:42:59 GMT
Title: Investigating Traffic Accident Detection Using Multimodal Large Language Models
Authors: Ilhan Skender, Kailin Tong, Selim Solmaz, Daniel Watzenig,
Abstract summary: This research investigates the zero-shot capabilities of multimodal large language models (MLLMs) for detecting and describing traffic accidents.<n>Results show Pixtral as the top performer with an F1-score of 71% and 83% recall.<n>These findings demonstrate the substantial potential of integrating MLLMs with advanced visual analytics techniques.
Score: 3.4123736336071864
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Traffic safety remains a critical global concern, with timely and accurate accident detection essential for hazard reduction and rapid emergency response. Infrastructure-based vision sensors offer scalable and efficient solutions for continuous real-time monitoring, facilitating automated detection of accidents directly from captured images. This research investigates the zero-shot capabilities of multimodal large language models (MLLMs) for detecting and describing traffic accidents using images from infrastructure cameras, thus minimizing reliance on extensive labeled datasets. Main contributions include: (1) Evaluation of MLLMs using the simulated DeepAccident dataset from CARLA, explicitly addressing the scarcity of diverse, realistic, infrastructure-based accident data through controlled simulations; (2) Comparative performance analysis between Gemini 1.5 and 2.0, Gemma 3 and Pixtral models in accident identification and descriptive capabilities without prior fine-tuning; and (3) Integration of advanced visual analytics, specifically YOLO for object detection, Deep SORT for multi-object tracking, and Segment Anything (SAM) for instance segmentation, into enhanced prompts to improve model accuracy and explainability. Key numerical results show Pixtral as the top performer with an F1-score of 71% and 83% recall, while Gemini models gained precision with enhanced prompts (e.g., Gemini 1.5 rose to 90%) but suffered notable F1 and recall losses. Gemma 3 offered the most balanced performance with minimal metric fluctuation. These findings demonstrate the substantial potential of integrating MLLMs with advanced visual analytics techniques, enhancing their applicability in real-world automated traffic monitoring systems.

Related papers

Surveillance Video-Based Traffic Accident Detection Using Transformer Architecture [2.621034368312571]
Traffic accidents represent a leading cause of mortality globally, with incidence rates due to increasing population, urbanization and motorization.<n>Traditional computer methods for accident detection struggle with limited understanding and poor cross-domain generalization.<n>We propose an accident detection model based on a transformer architecture using pre-extracted spatial video features.
arXiv Detail & Related papers (2025-12-12T07:57:36Z)
YOLO11-CR: a Lightweight Convolution-and-Attention Framework for Accurate Fatigue Driving Detection [0.0]
This paper introduces YOLO11-CR, a lightweight and efficient object detection model tailored for real-time fatigue monitoring.<n>YOLO11-CR introduces two key modules: the Convolution-and-Attention Fusion Module (CAFM) and the Rectangular Module (RCM)<n>Experiments on the DSM dataset demonstrated that YOLO11-CR achieves a precision of 87.17%, recall of 83.86%, mAP@50 of 88.09%, and mAP@50-95 of 55.93%.
arXiv Detail & Related papers (2025-08-16T07:19:04Z)
Contrastive Learning-Driven Traffic Sign Perception: Multi-Modal Fusion of Text and Vision [2.0720154517628417]
We propose a novel framework combining open-vocabulary detection and cross-modal learning.<n>For traffic sign detection, our NanoVerse YOLO model integrates a vision-language path aggregation network (RepVL-PAN) and an SPD-Conv module.<n>For traffic sign classification, we designed a Traffic Sign Recognition Multimodal Contrastive Learning model (TSR-MCL)<n>On the TT100K dataset, our method achieves a state-of-the-art 78.4% mAP in the long-tail detection task for all-class recognition.
arXiv Detail & Related papers (2025-07-31T08:23:30Z)
LSM-2: Learning from Incomplete Wearable Sensor Data [65.58595667477505]
This paper introduces the second generation of Large Sensor Model (LSM-2) with Adaptive and Inherited Masking (AIM)<n>AIM learns robust representations directly from incomplete data without requiring explicit imputation.<n>Our LSM-2 with AIM achieves the best performance across a diverse range of tasks, including classification, regression and generative modeling.
arXiv Detail & Related papers (2025-06-05T17:57:11Z)
Backdoor Cleaning without External Guidance in MLLM Fine-tuning [76.82121084745785]
Believe Your Eyes (BYE) is a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples.<n>It achieves near-zero attack success rates while maintaining clean-task performance.
arXiv Detail & Related papers (2025-05-22T17:11:58Z)
Floating Car Observers in Intelligent Transportation Systems: Detection Modeling and Temporal Insights [1.7205106391379021]
Floating Car Observers (FCOs) extend traditional Floating Car Data (FCD) by integrating onboard sensors to detect and localize other traffic participants.<n>We explore various modeling approaches for FCO detections within microscopic traffic simulations to evaluate their potential for Intelligent Transportation System (ITS) applications.
arXiv Detail & Related papers (2025-04-29T19:38:13Z)
Fast-COS: A Fast One-Stage Object Detector Based on Reparameterized Attention Vision Transformer for Autonomous Driving [3.617580194719686]
This paper introduces Fast-COS, a novel single-stage object detection framework crafted specifically for driving scenes.<n> RAViT achieves 81.4% Top-1 accuracy on the ImageNet-1K dataset.<n>It surpasses leading models in efficiency, delivering up to 75.9% faster GPU inference and 1.38 higher throughput on edge devices.
arXiv Detail & Related papers (2025-02-11T09:54:09Z)
From Objects to Events: Unlocking Complex Visual Understanding in Object Detectors via LLM-guided Symbolic Reasoning [71.41062111470414]
Current object detectors excel at entity localization and classification, yet exhibit inherent limitations in event recognition capabilities.<n>We present a novel framework that expands the capability of standard object detectors beyond mere object recognition to complex event understanding.<n>Our key innovation lies in bridging the semantic gap between object detection and event understanding without requiring expensive task-specific training.
arXiv Detail & Related papers (2025-02-09T10:30:54Z)
Uncertainty Estimation for 3D Object Detection via Evidential Learning [63.61283174146648]
We introduce a framework for quantifying uncertainty in 3D object detection by leveraging an evidential learning loss on Bird's Eye View representations in the 3D detector. We demonstrate both the efficacy and importance of these uncertainty estimates on identifying out-of-distribution scenes, poorly localized objects, and missing (false negative) detections.
arXiv Detail & Related papers (2024-10-31T13:13:32Z)
Using Multimodal Large Language Models for Automated Detection of Traffic Safety Critical Events [5.233512464561313]
Multimodal Large Language Models (MLLMs) offer a novel approach by integrating textual, visual, and audio modalities. Our framework leverages the reasoning power of MLLMs, directing their output through context-specific prompts. Preliminary results demonstrate the framework's potential in zero-shot learning and accurate scenario analysis.
arXiv Detail & Related papers (2024-06-19T23:50:41Z)
Learning Traffic Crashes as Language: Datasets, Benchmarks, and What-if Causal Analyses [76.59021017301127]
We propose a large-scale traffic crash language dataset, named CrashEvent, summarizing 19,340 real-world crash reports. We further formulate the crash event feature learning as a novel text reasoning problem and further fine-tune various large language models (LLMs) to predict detailed accident outcomes. Our experiments results show that our LLM-based approach not only predicts the severity of accidents but also classifies different types of accidents and predicts injury outcomes.
arXiv Detail & Related papers (2024-06-16T03:10:16Z)
AccidentBlip: Agent of Accident Warning based on MA-former [24.81148840857782]
AccidentBlip is a vision-only framework that employs our self-designed Motion Accident Transformer (MA-former) to process each frame of video.<n> AccidentBlip achieves performance in both accident detection and prediction tasks on the DeepAccident dataset.<n>It also outperforms current SOTA methods in V2V and V2X scenarios, demonstrating a superior capability to understand complex real-world environments.
arXiv Detail & Related papers (2024-04-18T12:54:25Z)
Unsupervised Domain Adaptation for Self-Driving from Past Traversal Features [69.47588461101925]
We propose a method to adapt 3D object detectors to new driving environments. Our approach enhances LiDAR-based detection models using spatial quantized historical features. Experiments on real-world datasets demonstrate significant improvements.
arXiv Detail & Related papers (2023-09-21T15:00:31Z)
Towards Multimodal Multitask Scene Understanding Models for Indoor Mobile Agents [49.904531485843464]
In this paper, we discuss the main challenge: insufficient, or even no, labeled data for real-world indoor environments. We describe MMISM (Multi-modality input Multi-task output Indoor Scene understanding Model) to tackle the above challenges. MMISM considers RGB images as well as sparse Lidar points as inputs and 3D object detection, depth completion, human pose estimation, and semantic segmentation as output tasks. We show that MMISM performs on par or even better than single-task models.
arXiv Detail & Related papers (2022-09-27T04:49:19Z)
Training-free Monocular 3D Event Detection System for Traffic Surveillance [93.65240041833319]
Existing event detection systems are mostly learning-based and have achieved convincing performance when a large amount of training data is available. In real-world scenarios, collecting sufficient labeled training data is expensive and sometimes impossible. We propose a training-free monocular 3D event detection system for traffic surveillance.
arXiv Detail & Related papers (2020-02-01T04:42:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.