Related papers: DriveMRP: Enhancing Vision-Language Models with Synthetic Motion Data for Motion Risk Prediction

DriveMRP: Enhancing Vision-Language Models with Synthetic Motion Data for Motion Risk Prediction

URL: http://arxiv.org/abs/2507.02948v3
Date: Sun, 13 Jul 2025 15:16:06 GMT
Title: DriveMRP: Enhancing Vision-Language Models with Synthetic Motion Data for Motion Risk Prediction
Authors: Zhiyi Hou, Enhui Ma, Fang Li, Zhiyi Lai, Kalok Ho, Zhanqian Wu, Lijun Zhou, Long Chen, Chitian Sun, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Kaicheng Yu,
Abstract summary: We introduce a Bird's-Eye View (BEV) based motion simulation method to model risks from three aspects: the ego-vehicle, other vehicles, and the environment.<n>This allows us to synthesize plug-and-play, high-risk motion data suitable for Vision-Language Models.<n>We design a VLM-agnostic motion risk estimation framework, named DriveMRP-Agent.
Score: 14.010956081539476
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autonomous driving has seen significant progress, driven by extensive real-world data. However, in long-tail scenarios, accurately predicting the safety of the ego vehicle's future motion remains a major challenge due to uncertainties in dynamic environments and limitations in data coverage. In this work, we aim to explore whether it is possible to enhance the motion risk prediction capabilities of Vision-Language Models (VLM) by synthesizing high-risk motion data. Specifically, we introduce a Bird's-Eye View (BEV) based motion simulation method to model risks from three aspects: the ego-vehicle, other vehicles, and the environment. This allows us to synthesize plug-and-play, high-risk motion data suitable for VLM training, which we call DriveMRP-10K. Furthermore, we design a VLM-agnostic motion risk estimation framework, named DriveMRP-Agent. This framework incorporates a novel information injection strategy for global context, ego-vehicle perspective, and trajectory projection, enabling VLMs to effectively reason about the spatial relationships between motion waypoints and the environment. Extensive experiments demonstrate that by fine-tuning with DriveMRP-10K, our DriveMRP-Agent framework can significantly improve the motion risk prediction performance of multiple VLM baselines, with the accident recognition accuracy soaring from 27.13% to 88.03%. Moreover, when tested via zero-shot evaluation on an in-house real-world high-risk motion dataset, DriveMRP-Agent achieves a significant performance leap, boosting the accuracy from base_model's 29.42% to 68.50%, which showcases the strong generalization capabilities of our method in real-world scenarios.

Related papers

Optimization-Guided Diffusion for Interactive Scene Generation [52.23368750264419]
We present OMEGA, an optimization-guided, training-free framework that enforces structural consistency and interaction awareness during diffusion-based sampling.<n>We show that OMEGA improves generation realism, consistency, and controllability, increasing the ratio of physically and behaviorally valid scenes.<n>Our approach can also generate $5times$ more near-collision frames with a time-to-collision under three seconds.
arXiv Detail & Related papers (2025-12-08T15:56:18Z)
Enhancing End-to-End Autonomous Driving with Risk Semantic Distillaion from VLM [14.016225216093643]
We introduce Risk Semantic Distillation (RSD), a novel framework that leverages Vision-Language Models (VLMs) to enhance the training of End-to-End (E2E) autonomous driving backbones.<n>Specifically, we introduce RiskHead, a plug-in module that distills causal risk estimates from Vision-Language Models into Bird's-Eye-View (BEV) features.<n>Our experiments on the Bench2Drive benchmark demonstrate the effectiveness of RSD in managing complex and unpredictable driving conditions.
arXiv Detail & Related papers (2025-11-18T13:46:18Z)
ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving [64.12414815634847]
Vision-Language Models (VLMs) and Driving World Models (DWMs) have independently emerged as powerful recipes addressing different aspects of this challenge.<n>We propose ImagiDrive, a novel end-to-end autonomous driving framework that integrates a VLM-based driving agent with a DWM-based scene imaginer.
arXiv Detail & Related papers (2025-08-15T12:06:55Z)
GeoDrive: 3D Geometry-Informed Driving World Model with Precise Action Control [50.67481583744243]
We introduce GeoDrive, which explicitly integrates robust 3D geometry conditions into driving world models.<n>We propose a dynamic editing module during training to enhance the renderings by editing the positions of the vehicles.<n>Our method significantly outperforms existing models in both action accuracy and 3D spatial awareness.
arXiv Detail & Related papers (2025-05-28T14:46:51Z)
RealEngine: Simulating Autonomous Driving in Realistic Context [60.55873455475112]
RealEngine is a novel driving simulation framework that holistically integrates 3D scene reconstruction and novel view synthesis techniques.<n>By leveraging real-world multi-modal sensor data, RealEngine reconstructs background scenes and foreground traffic participants separately, allowing for highly diverse and realistic traffic scenarios.<n>RealEngine supports three essential driving simulation categories: non-reactive simulation, safety testing, and multi-agent interaction.
arXiv Detail & Related papers (2025-05-22T17:01:00Z)
RADE: Learning Risk-Adjustable Driving Environment via Multi-Agent Conditional Diffusion [17.46462636610847]
Risk- Driving Environment (RADE) is a simulation framework that generates statistically realistic and risk-adjustable traffic scenes.<n>RADE learns risk-conditioned behaviors directly from data, preserving naturalistic multi-agent interactions with controllable risk levels.<n>We validate RADE on the real-world rounD dataset, demonstrating that it preserves statistical realism across varying risk levels.
arXiv Detail & Related papers (2025-05-06T04:41:20Z)
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning [68.45848423501927]
We propose a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning.<n>Our approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions.
arXiv Detail & Related papers (2025-04-06T03:54:21Z)
RAD: Retrieval-Augmented Decision-Making of Meta-Actions with Vision-Language Models in Autonomous Driving [10.984203470464687]
Vision-language models (VLMs) often suffer from limitations such as inadequate spatial perception and hallucination.<n>We propose a retrieval-augmented decision-making (RAD) framework to enhance VLMs' capabilities to reliably generate meta-actions in autonomous driving scenes.<n>We fine-tune VLMs on a dataset derived from the NuScenes dataset to enhance their spatial perception and bird's-eye view image comprehension capabilities.
arXiv Detail & Related papers (2025-03-18T03:25:57Z)
INSIGHT: Enhancing Autonomous Driving Safety through Vision-Language Models on Context-Aware Hazard Detection and Edge Case Evaluation [7.362380225654904]
INSIGHT is a hierarchical vision-language model (VLM) framework designed to enhance hazard detection and edge-case evaluation.<n>By using multimodal data fusion, our approach integrates semantic and visual representations, enabling precise interpretation of driving scenarios.<n> Experimental results on the BDD100K dataset demonstrate a substantial improvement in hazard prediction straightforwardness and accuracy over existing models.
arXiv Detail & Related papers (2025-02-01T01:43:53Z)
Probing Multimodal LLMs as World Models for Driving [72.18727651074563]
We look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving. Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored.
arXiv Detail & Related papers (2024-05-09T17:52:42Z)
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning [68.45848423501927]
We propose a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning.<n>Our approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions.
arXiv Detail & Related papers (2024-05-02T17:59:24Z)
Leveraging Driver Field-of-View for Multimodal Ego-Trajectory Prediction [69.29802752614677]
RouteFormer is a novel ego-trajectory prediction network combining GPS data, environmental context, and the driver's field-of-view.<n>To tackle data scarcity and enhance diversity, we introduce GEM, a dataset of urban driving scenarios enriched with synchronized driver field-of-view and gaze data.
arXiv Detail & Related papers (2023-12-13T23:06:30Z)
Learning Terrain-Aware Kinodynamic Model for Autonomous Off-Road Rally Driving With Model Predictive Path Integral Control [4.23755398158039]
We propose a method for learning terrain-aware kinodynamic model conditioned on both proprioceptive and exteroceptive information. The proposed model generates reliable predictions of 6-degree-of-freedom motion and can even estimate contact interactions. We demonstrate the effectiveness of our approach through experiments on a simulated off-road track, showing that our proposed model-controller pair outperforms the baseline.
arXiv Detail & Related papers (2023-05-01T06:09:49Z)
Nonprehensile Riemannian Motion Predictive Control [57.295751294224765]
We introduce a novel Real-to-Sim reward analysis technique to reliably imagine and predict the outcome of taking possible actions for a real robotic platform. We produce a closed-loop controller to reactively push objects in a continuous action space. We observe that RMPC is robust in cluttered as well as occluded environments and outperforms the baselines.
arXiv Detail & Related papers (2021-11-15T18:50:04Z)
I Know You Can't See Me: Dynamic Occlusion-Aware Safety Validation of Strategic Planners for Autonomous Vehicles Using Hypergames [12.244501203346566]
We develop a novel multi-agent dynamic occlusion risk measure for assessing situational risk. We present a white-box, scenario-based, accelerated safety validation framework for assessing safety of strategic planners in AV.
arXiv Detail & Related papers (2021-09-20T19:38:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.