Road Rage Reasoning with Vision-language Models (VLMs): Task Definition and Evaluation Dataset
- URL: http://arxiv.org/abs/2503.11342v1
- Date: Fri, 14 Mar 2025 12:18:11 GMT
- Title: Road Rage Reasoning with Vision-language Models (VLMs): Task Definition and Evaluation Dataset
- Authors: Yibing Weng, Yu Gu, Fuji Ren,
- Abstract summary: Road rage, triggered by driving-related stimuli such as traffic congestion and aggressive driving, poses a significant threat to road safety.<n>Previous research on road rage regulation has primarily focused on response suppression, lacking proactive prevention capabilities.<n>With the advent of Vision-Language Models (VLMs), it has become possible to reason about trigger events visually and then engage in dialog-based comforting before drivers' anger escalates.
- Score: 4.357836359387452
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Road rage, triggered by driving-related stimuli such as traffic congestion and aggressive driving, poses a significant threat to road safety. Previous research on road rage regulation has primarily focused on response suppression, lacking proactive prevention capabilities. With the advent of Vision-Language Models (VLMs), it has become possible to reason about trigger events visually and then engage in dialog-based comforting before drivers' anger escalates. To this end, we propose the road rage reasoning task, along with a finely annotated test dataset and evaluation metrics, to assess the capabilities of current mainstream VLMs in scene understanding, event recognition, and road rage reasoning. The results indicate that current VLMs exhibit significant shortcomings in scene understanding within the visual modality, as well as in comprehending the spatial relationships between objects in the textual modality. Improving VLMs' performance in these areas will greatly benefit downstream tasks like antecedent-focused road rage regulation.
Related papers
- Natural Reflection Backdoor Attack on Vision Language Model for Autonomous Driving [55.96227460521096]
Vision-Language Models (VLMs) have been integrated into autonomous driving systems to enhance reasoning capabilities.<n>We propose a natural reflection-based backdoor attack targeting VLM systems in autonomous driving scenarios.<n>Our findings uncover a new class of attacks that exploit the stringent real-time requirements of autonomous driving.
arXiv Detail & Related papers (2025-05-09T20:28:17Z) - Black-Box Adversarial Attack on Vision Language Models for Autonomous Driving [65.61999354218628]
We take the first step toward designing black-box adversarial attacks specifically targeting vision-language models (VLMs) in autonomous driving systems.<n>We propose Cascading Adversarial Disruption (CAD), which targets low-level reasoning breakdown by generating and injecting semantics.<n>We present Risky Scene Induction, which addresses dynamic adaptation by leveraging a surrogate VLM to understand and construct high-level risky scenarios.
arXiv Detail & Related papers (2025-01-23T11:10:02Z) - Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives [56.528835143531694]
We introduce DriveBench, a benchmark dataset designed to evaluate Vision-Language Models (VLMs)<n>Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding.<n>We propose refined evaluation metrics that prioritize robust visual grounding and multi-modal understanding.
arXiv Detail & Related papers (2025-01-07T18:59:55Z) - DAVE: Diverse Atomic Visual Elements Dataset with High Representation of Vulnerable Road Users in Complex and Unpredictable Environments [60.69159598130235]
We present a new dataset, DAVE, designed for evaluating perception methods with high representation of Vulnerable Road Users (VRUs)
DAVE is a manually annotated dataset encompassing 16 diverse actor categories (spanning animals, humans, vehicles, etc.) and 16 action types (complex and rare cases like cut-ins, zigzag movement, U-turn, etc.)
Our experiments show that existing methods suffer degradation in performance when evaluated on DAVE, highlighting its benefit for future video recognition research.
arXiv Detail & Related papers (2024-12-28T06:13:44Z) - A Memory-Augmented Multi-Task Collaborative Framework for Unsupervised
Traffic Accident Detection in Driving Videos [22.553356096143734]
We propose a novel memory-augmented multi-task collaborative framework (MAMTCF) for unsupervised traffic accident detection in driving videos.
Our method can more accurately detect both ego-involved and non-ego accidents by simultaneously modeling appearance changes and object motions in video frames.
arXiv Detail & Related papers (2023-07-27T01:45:13Z) - OpenLane-V2: A Topology Reasoning Benchmark for Unified 3D HD Mapping [84.65114565766596]
We present OpenLane-V2, the first dataset on topology reasoning for traffic scene structure.
OpenLane-V2 consists of 2,000 annotated road scenes that describe traffic elements and their correlation to the lanes.
We evaluate various state-of-the-art methods, and present their quantitative and qualitative results on OpenLane-V2 to indicate future avenues for investigating topology reasoning in traffic scenes.
arXiv Detail & Related papers (2023-04-20T16:31:22Z) - Salient Sign Detection In Safe Autonomous Driving: AI Which Reasons Over
Full Visual Context [2.799896314754614]
Various traffic signs in a driving scene have an unequal impact on the driver's decisions.
We construct a traffic sign detection model which emphasizes performance on salient signs.
We show that a model trained with Salience-Sensitive Focal Loss outperforms a model trained without.
arXiv Detail & Related papers (2023-01-14T01:47:09Z) - Cognitive Accident Prediction in Driving Scenes: A Multimodality
Benchmark [77.54411007883962]
We propose a Cognitive Accident Prediction (CAP) method that explicitly leverages human-inspired cognition of text description on the visual observation and the driver attention to facilitate model training.
CAP is formulated by an attentive text-to-vision shift fusion module, an attentive scene context transfer module, and the driver attention guided accident prediction module.
We construct a new large-scale benchmark consisting of 11,727 in-the-wild accident videos with over 2.19 million frames.
arXiv Detail & Related papers (2022-12-19T11:43:02Z) - Learning energy-efficient driving behaviors by imitating experts [75.12960180185105]
This paper examines the role of imitation learning in bridging the gap between control strategies and realistic limitations in communication and sensing.
We show that imitation learning can succeed in deriving policies that, if adopted by 5% of vehicles, may boost the energy-efficiency of networks with varying traffic conditions by 15% using only local observations.
arXiv Detail & Related papers (2022-06-28T17:08:31Z) - DRIVE: Deep Reinforced Accident Anticipation with Visual Explanation [36.350348194248014]
Traffic accident anticipation aims to accurately and promptly predict the occurrence of a future accident from dashcam videos.
Existing approaches typically focus on capturing the cues of spatial and temporal context before a future accident occurs.
We propose Deep ReInforced accident anticipation with Visual Explanation, named DRIVE.
arXiv Detail & Related papers (2021-07-21T16:33:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.