Related papers: When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis

When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis

URL: http://arxiv.org/abs/2501.10604v1
Date: Fri, 17 Jan 2025 23:35:34 GMT
Title: When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis
Authors: Ruixuan Zhang, Beichen Wang, Juexiao Zhang, Zilin Bian, Chen Feng, Kaan Ozbay,
Abstract summary: SeeUnsafe is a framework that transforms video-based traffic accident analysis into a more interactive, conversational approach.<n>Our framework employs a multimodal-based aggregation strategy to handle videos of various lengths and generate structured responses for review and evaluation.<n>We conduct extensive experiments on the Toyota Woven Traffic Safety dataset, demonstrating that SeeUnsafe effectively performs accident-aware video classification and visual grounding.
Score: 6.213279061986497
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The increasing availability of traffic videos functioning on a 24/7/365 time scale has the great potential of increasing the spatio-temporal coverage of traffic accidents, which will help improve traffic safety. However, analyzing footage from hundreds, if not thousands, of traffic cameras in a 24/7/365 working protocol remains an extremely challenging task, as current vision-based approaches primarily focus on extracting raw information, such as vehicle trajectories or individual object detection, but require laborious post-processing to derive actionable insights. We propose SeeUnsafe, a new framework that integrates Multimodal Large Language Model (MLLM) agents to transform video-based traffic accident analysis from a traditional extraction-then-explanation workflow to a more interactive, conversational approach. This shift significantly enhances processing throughput by automating complex tasks like video classification and visual grounding, while improving adaptability by enabling seamless adjustments to diverse traffic scenarios and user-defined queries. Our framework employs a severity-based aggregation strategy to handle videos of various lengths and a novel multimodal prompt to generate structured responses for review and evaluation and enable fine-grained visual grounding. We introduce IMS (Information Matching Score), a new MLLM-based metric for aligning structured responses with ground truth. We conduct extensive experiments on the Toyota Woven Traffic Safety dataset, demonstrating that SeeUnsafe effectively performs accident-aware video classification and visual grounding by leveraging off-the-shelf MLLMs. Source code will be available at \url{https://github.com/ai4ce/SeeUnsafe}.

Related papers

Exploring the Roles of Large Language Models in Reshaping Transportation Systems: A Survey, Framework, and Roadmap [51.198001060683296]
Large Language Models (LLMs) offer transformative potential to address transportation challenges. This survey first presents LLM4TR, a novel conceptual framework that systematically categorizes the roles of LLMs in transportation. For each role, our review spans diverse applications, from traffic prediction and autonomous driving to safety analytics and urban mobility optimization.
arXiv Detail & Related papers (2025-03-27T11:56:27Z)
Towards Intelligent Transportation with Pedestrians and Vehicles In-the-Loop: A Surveillance Video-Assisted Federated Digital Twin Framework [62.47416496137193]
We propose a surveillance video assisted federated digital twin (SV-FDT) framework to empower ITSs with pedestrians and vehicles in-the-loop. The architecture consists of three layers: (i) the end layer, which collects traffic surveillance videos from multiple sources; (ii) the edge layer, responsible for semantic segmentation-based visual understanding, twin agent-based interaction modeling, and local digital twin system (LDTS) creation in local regions; and (iii) the cloud layer, which integrates LDTSs across different regions to construct a global DT model in realtime.
arXiv Detail & Related papers (2025-03-06T07:36:06Z)
SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models [63.71984266104757]
Multimodal Large Language Models (MLLMs) can process both visual and textual data. We propose SafeAuto, a novel framework that enhances MLLM-based autonomous driving systems by incorporating both unstructured and structured knowledge.
arXiv Detail & Related papers (2025-02-28T21:53:47Z)
Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation [49.113131249753714]
We propose an efficient algorithm, termed MTNet, which concurrently exploits motion and temporal cues.<n> MTNet is devised by effectively merging appearance and motion features during the feature extraction process within encoders.<n>We employ a cascade of decoders all feature levels across all feature levels to optimally exploit the derived features.
arXiv Detail & Related papers (2025-01-14T03:15:46Z)
Cross-domain Few-shot In-context Learning for Enhancing Traffic Sign Recognition [49.20086587208214]
We propose a cross-domain few-shot in-context learning method based on the MLLM for enhancing traffic sign recognition. By using description texts, our method reduces the cross-domain differences between template and real traffic signs. Our approach requires only simple and uniform textual indications, without the need for large-scale traffic sign images and labels.
arXiv Detail & Related papers (2024-07-08T10:51:03Z)
Using Multimodal Large Language Models for Automated Detection of Traffic Safety Critical Events [5.233512464561313]
Multimodal Large Language Models (MLLMs) offer a novel approach by integrating textual, visual, and audio modalities. Our framework leverages the reasoning power of MLLMs, directing their output through context-specific prompts. Preliminary results demonstrate the framework's potential in zero-shot learning and accurate scenario analysis.
arXiv Detail & Related papers (2024-06-19T23:50:41Z)
Probing Multimodal LLMs as World Models for Driving [72.18727651074563]
We look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving. Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored.
arXiv Detail & Related papers (2024-05-09T17:52:42Z)
TrafficMOT: A Challenging Dataset for Multi-Object Tracking in Complex Traffic Scenarios [23.831048188389026]
Multi-object tracking in traffic videos offers immense potential for enhancing traffic monitoring accuracy and promoting road safety measures. Existing datasets for multi-object tracking in traffic videos often feature limited instances or focus on single classes. We introduce TrafficMOT, an extensive dataset designed to encompass diverse traffic situations with complex scenarios.
arXiv Detail & Related papers (2023-11-30T18:59:56Z)
HiLM-D: Enhancing MLLMs with Multi-Scale High-Resolution Details for Autonomous Driving [44.06475712570428]
HiLM-D is a resource-efficient framework that enhances visual information processing in MLLMs for ROLISP. Our method is motivated by the fact that the primary variations in autonomous driving scenarios are the motion trajectories. Experiments show HiLM-D's significant improvements over current MLLMs, with a 3.7% in BLEU-4 for captioning and 8.7% in mIoU for detection.
arXiv Detail & Related papers (2023-09-11T01:24:13Z)
A Memory-Augmented Multi-Task Collaborative Framework for Unsupervised Traffic Accident Detection in Driving Videos [22.553356096143734]
We propose a novel memory-augmented multi-task collaborative framework (MAMTCF) for unsupervised traffic accident detection in driving videos. Our method can more accurately detect both ego-involved and non-ego accidents by simultaneously modeling appearance changes and object motions in video frames.
arXiv Detail & Related papers (2023-07-27T01:45:13Z)
Traffic-Domain Video Question Answering with Automatic Captioning [69.98381847388553]
Video Question Answering (VidQA) exhibits remarkable potential in facilitating advanced machine reasoning capabilities. We present a novel approach termed Traffic-domain Video Question Answering with Automatic Captioning (TRIVIA), which serves as a weak-supervision technique for infusing traffic-domain knowledge into large video-language models.
arXiv Detail & Related papers (2023-07-18T20:56:41Z)
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal. We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z)
A novel efficient Multi-view traffic-related object detection framework [17.50049841016045]
We propose a novel traffic-related framework named CEVAS to achieve efficient object detection using multi-view video data. Results show that our framework significantly reduces response latency while achieving the same detection accuracy as the state-of-the-art methods.
arXiv Detail & Related papers (2023-02-23T06:42:37Z)
Deep Learning Serves Traffic Safety Analysis: A Forward-looking Review [4.228522109021283]
We present a typical processing pipeline, which can be used to understand and interpret traffic videos. This processing framework includes several steps, including video enhancement, video stabilization, semantic and incident segmentation, object detection and classification, trajectory extraction, speed estimation, event analysis, modeling and anomaly detection.
arXiv Detail & Related papers (2022-03-07T17:21:07Z)
Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images. Our approach is fully automatic without any human interaction. We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.