Related papers: Visual Reasoning at Urban Intersections: FineTuning GPT-4o for Traffic Conflict Detection

Visual Reasoning at Urban Intersections: FineTuning GPT-4o for Traffic Conflict Detection

URL: http://arxiv.org/abs/2502.20573v1
Date: Thu, 27 Feb 2025 22:26:29 GMT
Title: Visual Reasoning at Urban Intersections: FineTuning GPT-4o for Traffic Conflict Detection
Authors: Sari Masri, Huthaifa I. Ashqar, Mohammed Elhenawy,
Abstract summary: This study explores the capability of leveraging Multimodal Large Language Models (MLLMs) to provide logical and visual reasoning.<n>In this proposed method, GPT-4o acts as intelligent system to detect conflicts and provide explanations and recommendations for the drivers.
Score: 5.233512464561313
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Traffic control in unsignalized urban intersections presents significant challenges due to the complexity, frequent conflicts, and blind spots. This study explores the capability of leveraging Multimodal Large Language Models (MLLMs), such as GPT-4o, to provide logical and visual reasoning by directly using birds-eye-view videos of four-legged intersections. In this proposed method, GPT-4o acts as intelligent system to detect conflicts and provide explanations and recommendations for the drivers. The fine-tuned model achieved an accuracy of 77.14%, while the manual evaluation of the true predicted values of the fine-tuned GPT-4o showed significant achievements of 89.9% accuracy for model-generated explanations and 92.3% for the recommended next actions. These results highlight the feasibility of using MLLMs for real-time traffic management using videos as inputs, offering scalable and actionable insights into intersections traffic management and operation. Code used in this study is available at https://github.com/sarimasri3/Traffic-Intersection-Conflict-Detection-using-images.git.

Related papers

Multi-modal Traffic Scenario Generation for Autonomous Driving System Testing [10.518062593457351]
TrafficComposer is a multi-modal traffic scenario construction approach for autonomous driving systems (ADS) testing.<n>It generates the corresponding traffic scenario in a simulator, such as CARLA and LGSVL.<n>On a benchmark of 120 traffic scenarios, TrafficComposer achieves 97.0% accuracy, outperforming the best-performing baseline by 7.3%.
arXiv Detail & Related papers (2025-05-20T20:12:08Z)
Leveraging Multimodal-LLMs Assisted by Instance Segmentation for Intelligent Traffic Monitoring [6.648291808015463]
This research leverages the LLaVA visual grounding multimodal large language model (LLM) for traffic monitoring tasks on the real-time Quanser Interactive Lab simulation platform.<n>Cameras placed at multiple urban locations collect real-time images from the simulation, which are fed into the LLaVA model with queries for analysis.<n>The system achieves 84.3% accuracy in recognizing vehicle locations and 76.4% in determining steering direction, outperforming traditional models.
arXiv Detail & Related papers (2025-02-16T23:03:26Z)
DAVE: Diverse Atomic Visual Elements Dataset with High Representation of Vulnerable Road Users in Complex and Unpredictable Environments [60.69159598130235]
We present a new dataset, DAVE, designed for evaluating perception methods with high representation of Vulnerable Road Users (VRUs)<n>DAVE is a manually annotated dataset encompassing 16 diverse actor categories (spanning animals, humans, vehicles, etc.) and 16 action types (complex and rare cases like cut-ins, zigzag movement, U-turn, etc.)<n>Our experiments show that existing methods suffer degradation in performance when evaluated on DAVE, highlighting its benefit for future video recognition research.
arXiv Detail & Related papers (2024-12-28T06:13:44Z)
Traffic Co-Simulation Framework Empowered by Infrastructure Camera Sensing and Reinforcement Learning [4.336971448707467]
Multi-agent reinforcement learning (MARL) is particularly effective for learning control strategies for traffic lights in a network using iterative simulations. This study proposes a co-simulation framework integrating CARLA and SUMO, which combines high-fidelity 3D modeling with large-scale traffic flow simulation. Experiments in the test-bed demonstrate the effectiveness of the proposed MARL approach in enhancing traffic conditions using real-time camera-based detection.
arXiv Detail & Related papers (2024-12-05T07:01:56Z)
Large Language Models (LLMs) as Traffic Control Systems at Urban Intersections: A New Paradigm [5.233512464561313]
This study introduces a novel approach for traffic control systems by using Large Language Models (LLMs) as traffic controllers. The study utilizes their logical reasoning, scene understanding, and decision-making capabilities to optimize throughput and provide feedback based on traffic conditions in real-time.
arXiv Detail & Related papers (2024-11-16T19:23:52Z)
Strada-LLM: Graph LLM for traffic prediction [62.2015839597764]
A considerable challenge in traffic prediction lies in handling the diverse data distributions caused by vastly different traffic conditions.<n>We propose a graph-aware LLM for traffic prediction that considers proximal traffic information.<n>We adopt a lightweight approach for efficient domain adaptation when facing new data distributions in few-shot fashion.
arXiv Detail & Related papers (2024-10-28T09:19:29Z)
Leveraging Large Language Models (LLMs) for Traffic Management at Urban Intersections: The Case of Mixed Traffic Scenarios [5.233512464561313]
This study explores the ability of a Large Language Model (LLM) to improve traffic management at urban intersections. We recruited GPT-4o-mini to analyze, predict position, detect and resolve the conflicts at an intersection in real-time. Results show the GPT-4o-mini was effectively able to detect and resolve conflicts in heavy traffic, congestion, and mixed-speed conditions.
arXiv Detail & Related papers (2024-08-01T23:06:06Z)
GPT-4V as Traffic Assistant: An In-depth Look at Vision Language Model on Complex Traffic Events [25.51232964290688]
The recognition and understanding of traffic incidents, particularly traffic accidents, is a topic of paramount importance in the realm of intelligent transportation systems and vehicles. The advent of large vision-language models (VLMs) such as GPT-4V has introduced innovative approaches to addressing this issue. We observe that GPT-4V demonstrates remarkable cognitive, reasoning, and decision-making ability in certain classic traffic events.
arXiv Detail & Related papers (2024-02-03T16:38:25Z)
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model [84.29836263441136]
This study introduces DriveGPT4, a novel interpretable end-to-end autonomous driving system based on multimodal large language models (MLLMs) DriveGPT4 facilitates the interpretation of vehicle actions, offers pertinent reasoning, and effectively addresses a diverse range of questions posed by users.
arXiv Detail & Related papers (2023-10-02T17:59:52Z)
iPLAN: Intent-Aware Planning in Heterogeneous Traffic via Distributed Multi-Agent Reinforcement Learning [57.24340061741223]
We introduce a distributed multi-agent reinforcement learning (MARL) algorithm that can predict trajectories and intents in dense and heterogeneous traffic scenarios. Our approach for intent-aware planning, iPLAN, allows agents to infer nearby drivers' intents solely from their local observations.
arXiv Detail & Related papers (2023-06-09T20:12:02Z)
End-to-End Intersection Handling using Multi-Agent Deep Reinforcement Learning [63.56464608571663]
Navigating through intersections is one of the main challenging tasks for an autonomous vehicle. In this work, we focus on the implementation of a system able to navigate through intersections where only traffic signs are provided. We propose a multi-agent system using a continuous, model-free Deep Reinforcement Learning algorithm used to train a neural network for predicting both the acceleration and the steering angle at each time step.
arXiv Detail & Related papers (2021-04-28T07:54:40Z)
Multi-Modal Fusion Transformer for End-to-End Autonomous Driving [59.60483620730437]
We propose TransFuser, a novel Multi-Modal Fusion Transformer, to integrate image and LiDAR representations using attention. Our approach achieves state-of-the-art driving performance while reducing collisions by 76% compared to geometry-based fusion.
arXiv Detail & Related papers (2021-04-19T11:48:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.