Related papers: OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving

OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving

URL: http://arxiv.org/abs/2512.14044v1
Date: Tue, 16 Dec 2025 03:19:28 GMT
Title: OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving
Authors: Zhenguo Zhang, Haohan Zhen, Yishen Wang, Le Xu, Tianchen Deng, Xuefeng Chen, Qu Chen, Bo Zhang, Wuxiong Huang,
Abstract summary: We introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving.<n>Our core innovation is an Reinforcement-driven visual grounding capability.<n>This capability is enabled by our pure two-stage reinforcement learning training pipeline and Clip-GRPO algorithm.
Score: 11.793091014248025
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The deployment of Vision-Language Models (VLMs) in safety-critical domains like autonomous driving (AD) is critically hindered by reliability failures, most notably object hallucination. This failure stems from their reliance on ungrounded, text-based Chain-of-Thought (CoT) reasoning.While existing multi-modal CoT approaches attempt mitigation, they suffer from two fundamental flaws: (1) decoupled perception and reasoning stages that prevent end-to-end joint optimization, and (2) reliance on expensive, dense localization labels.Thus we introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving, which unifies perception and reasoning through an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Our core innovation is an Reinforcement-driven visual grounding capability, enabling the model to autonomously direct its attention and "zoom in" on critical regions for fine-grained analysis. This capability is enabled by our pure two-stage reinforcement learning training pipeline and Clip-GRPO algorithm. Crucially, Clip-GRPO introduces an annotation-free, process-based grounding reward. This reward not only eliminates the need for dense labels but also circumvents the instability of external tool calls by enforcing real-time cross-modal consistency between the visual focus and the textual reasoning. Extensive experiments on DriveLMM-o1 demonstrate our model's significant improvements. Compared to the baseline Qwen2.5VL-7B, OmniDrive-R1 improves the overall reasoning score from 51.77% to 80.35%, and the final answer accuracy from 37.81% to 73.62%.

Related papers

DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous Driving [65.7087560656003]
Generative diffusion models for end-to-end autonomous driving often suffer from mode collapse.<n>We propose DiffusionDriveV2, which leverages reinforcement learning to constrain low-quality modes and explore for superior trajectories.<n>This significantly enhances the overall output quality while preserving the inherent multimodality of its core Gaussian Mixture Model.
arXiv Detail & Related papers (2025-12-08T17:29:52Z)
dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning [69.36145467833498]
We introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving.<n> evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems.
arXiv Detail & Related papers (2025-12-04T05:05:41Z)
Plug-and-Play Clarifier: A Zero-Shot Multimodal Framework for Egocentric Intent Disambiguation [60.63465682731118]
The performance of egocentric AI agents is fundamentally limited by multimodal intent ambiguity.<n>We introduce the Plug-and-Play Clarifier, a zero-shot and modular framework that decomposes the problem into discrete, solvable sub-tasks.<n>Our framework improves the intent clarification performance of small language models by approximately 30%, making them competitive with significantly larger counterparts.
arXiv Detail & Related papers (2025-11-12T04:28:14Z)
Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail [85.47497935739936]
Alpamayo-R1 (AR1) is a vision-language-action model that integrates Chain of Causation reasoning with trajectory planning.<n>We show AR1 achieves 12% improvement in planning accuracy on challenging cases compared to a trajectory-only baseline.<n>We plan to release AR1 models and a subset of the CoC in a future update.
arXiv Detail & Related papers (2025-10-30T01:25:34Z)
Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving [55.13109926181247]
We introduce ReflectDrive, a learning-based framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion.<n>Central to our approach is a safety-aware reflection mechanism that performs iterative self-correction without gradient.<n>Our method begins with goal-conditioned trajectory generation to model multi-modal driving behaviors.
arXiv Detail & Related papers (2025-09-24T13:35:15Z)
AdaThinkDrive: Adaptive Thinking via Reinforcement Learning for Autonomous Driving [21.10362636088305]
Chain of Thought (CoT) has been widely adopted in Vision Language Action (VLA) models.<n>We propose AdaThinkDrive, a novel VLA framework with a dual mode reasoning mechanism inspired by fast and slow thinking.
arXiv Detail & Related papers (2025-09-17T07:35:39Z)
AutoDrive-R$^2$: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving [37.260140808367716]
We propose AutoDrive-R$2$, a novel VLA framework that enhances both reasoning and self-reflection capabilities of autonomous driving systems.<n>We first propose an innovative CoT dataset named nuScenesR$2$-6K for supervised fine-tuning.<n>We then employ the Group Relative Policy Optimization (GRPO) algorithm within a physics-grounded reward framework to ensure reliable smoothness and realistic trajectory planning.
arXiv Detail & Related papers (2025-09-02T04:32:24Z)
DriveAgent-R1: Advancing VLM-based Autonomous Driving with Active Perception and Hybrid Thinking [33.98300989562812]
We introduce DriveAgent-R1, the first autonomous driving agent capable of active perception for planning.<n>In complex scenarios, DriveAgent-R1 proactively invokes tools to perform visual reasoning, firmly grounding its decisions in visual evidence.<n>We propose a hybrid thinking framework, inspired by human driver cognitive patterns, allowing the agent to adaptively switch between efficient text-only reasoning and robust tool-augmented visual reasoning.
arXiv Detail & Related papers (2025-07-28T14:33:15Z)
ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving [26.169506297338234]
multimodal large language models (MLLMs) have garnered significant attention in the field of end-to-end (E2E) autonomous driving.<n>We propose ReasonPlan, a novel MLLM fine-tuning framework designed for closed-loop driving through holistic reasoning.<n>Our method outperforms the mainstream E2E imitation learning method by a large margin of 19% L2 and 16.1 driving score on Bench2Drive benchmark.
arXiv Detail & Related papers (2025-05-26T14:12:38Z)
Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models [86.88657425848547]
Large reasoning models (LRMs) already possess a latent capacity for long chain-of-thought reasoning.<n>We explicitly align models with three meta-abilities: deduction, induction, and abduction, using automatically generated, self-verifiable tasks.<n>Our three stage-pipeline individual alignment, parameter-space merging, and domain-specific reinforcement learning, boosts performance by over 10% relative to instruction-tuned baselines.
arXiv Detail & Related papers (2025-05-15T17:58:33Z)
SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models [63.71984266104757]
We propose SafeAuto, a framework that enhances MLLM-based autonomous driving by incorporating both unstructured and structured knowledge.<n>To explicitly integrate safety knowledge, we develop a reasoning component that translates traffic rules into first-order logic.<n>Our Multimodal Retrieval-Augmented Generation model leverages video, control signals, and environmental attributes to learn from past driving experiences.
arXiv Detail & Related papers (2025-02-28T21:53:47Z)
DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Self-Driving [55.53171248839489]
We propose an ego-centric fully sparse paradigm, named DiFSD, for end-to-end self-driving.<n>Specifically, DiFSD mainly consists of sparse perception, hierarchical interaction and iterative motion planner.<n>Experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superior planning performance and great efficiency of DiFSD.
arXiv Detail & Related papers (2024-09-15T15:55:24Z)
Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving [38.28159034562901]
Reason2Drive is a benchmark dataset with over 600K video-text pairs. We characterize the autonomous driving process as a sequential combination of perception, prediction, and reasoning steps. We introduce a novel aggregated evaluation metric to assess chain-based reasoning performance in autonomous systems.
arXiv Detail & Related papers (2023-12-06T18:32:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.