More Than Meets the Eye? Uncovering the Reasoning-Planning Disconnect in Training Vision-Language Driving Models
- URL: http://arxiv.org/abs/2510.04532v1
- Date: Mon, 06 Oct 2025 06:50:16 GMT
- Title: More Than Meets the Eye? Uncovering the Reasoning-Planning Disconnect in Training Vision-Language Driving Models
- Authors: Xurui Song, Shuo Huai, JingJing Jiang, Jiayi Kong, Jun Luo,
- Abstract summary: We build DriveMind, a large-scale driving Visual Question Answering (VQA) corpus with plan-aligned Chain-of-Thought (CoT)<n>Our results, unfortunately, indicate a consistent causal disconnect in reasoning-planning.<n>We propose the Reasoning-Planning Decoupling Hypothesis, positing that the training-yielded reasoning is an ancillary byproduct rather than a causal mediator.
- Score: 14.964195894958133
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language Model (VLM) driving agents promise explainable end-to-end autonomy by first producing natural-language reasoning and then predicting trajectory planning. However, whether planning is causally driven by this reasoning remains a critical but unverified assumption. To investigate this, we build DriveMind, a large-scale driving Visual Question Answering (VQA) corpus with plan-aligned Chain-of-Thought (CoT), automatically generated from nuPlan. Our data generation process converts sensors and annotations into structured inputs and, crucially, separates priors from to-be-reasoned signals, enabling clean information ablations. Using DriveMind, we train representative VLM agents with Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) and evaluate them with nuPlan's metrics. Our results, unfortunately, indicate a consistent causal disconnect in reasoning-planning: removing ego/navigation priors causes large drops in planning scores, whereas removing CoT produces only minor changes. Attention analysis further shows that planning primarily focuses on priors rather than the CoT. Based on this evidence, we propose the Reasoning-Planning Decoupling Hypothesis, positing that the training-yielded reasoning is an ancillary byproduct rather than a causal mediator. To enable efficient diagnosis, we also introduce a novel, training-free probe that measures an agent's reliance on priors by evaluating its planning robustness against minor input perturbations. In summary, we provide the community with a new dataset and a diagnostic tool to evaluate the causal fidelity of future models.
Related papers
- PlanTRansformer: Unified Prediction and Planning with Goal-conditioned Transformer [0.0]
Trajectory prediction and planning are fundamental yet disconnected components in autonomous driving.<n>Plan TRansformer (PTR) integrates goal-conditioned prediction, dynamic feasibility, interaction awareness, and lane-level topology reasoning.<n>PTR achieves 4.3%/3.5% improvement in marginal/joint mAP compared to the baseline Motion Transformer (MTR) and 15.5% planning error reduction at 5s horizon compared to GameFormer.
arXiv Detail & Related papers (2026-02-03T10:55:05Z) - ResAD: Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving [64.42138266293202]
ResAD is a Normalized Residual Trajectory Modeling framework.<n>It reframes the learning task to predict the residual deviation from an inertial reference.<n>On the NAVSIM benchmark, ResAD achieves a state-of-the-art PDMS of 88.6 using a vanilla diffusion policy.
arXiv Detail & Related papers (2025-10-09T17:59:36Z) - Language Model Planning from an Information Theoretic Perspective [31.323156960716826]
decoder-only language models (LMs) organize intermediate computations to support coherent long-range generation.<n>Planning involves structuring computations over long horizons, considering multiple possible continuations, and selectively reusing past information.<n>We study planning in LMs across synthetic grammar, path-finding tasks, and natural language datasets.
arXiv Detail & Related papers (2025-09-28T01:58:15Z) - ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving [49.07731497951963]
ReCogDrive is a novel Reinforced Cognitive framework for end-to-end autonomous driving.<n>We introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers.<n>We then address the language-action mismatch by injecting the VLM's learned driving priors into a diffusion planner.
arXiv Detail & Related papers (2025-06-09T03:14:04Z) - Beyond Patterns: Harnessing Causal Logic for Autonomous Driving Trajectory Prediction [10.21659221112514]
We introduce a novel trajectory prediction framework that leverages causal inference to enhance predictive robustness, generalization, and accuracy.<n>Our findings highlight the potential of causal reasoning to transform trajectory prediction, paving the way for robust autonomous driving systems.
arXiv Detail & Related papers (2025-05-11T05:56:07Z) - Improve Vision Language Model Chain-of-thought Reasoning [86.83335752119741]
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness.
We show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses.
arXiv Detail & Related papers (2024-10-21T17:00:06Z) - Making Large Language Models Better Planners with Reasoning-Decision Alignment [70.5381163219608]
We motivate an end-to-end decision-making model based on multimodality-augmented LLM.
We propose a reasoning-decision alignment constraint between the paired CoTs and planning results.
We dub our proposed large language planners with reasoning-decision alignment as RDA-Driver.
arXiv Detail & Related papers (2024-08-25T16:43:47Z) - Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving? [84.17711168595311]
End-to-end autonomous driving has emerged as a promising research direction to target autonomy from a full-stack perspective.
nuScenes dataset, characterized by relatively simple driving scenarios, leads to an under-utilization of perception information in end-to-end models.
We introduce a new metric to evaluate whether the predicted trajectories adhere to the road.
arXiv Detail & Related papers (2023-12-05T11:32:31Z) - Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in
nuScenes [38.43491956142818]
Planning task involves predicting the trajectory of the ego vehicle based on inputs from both internal intention and the external environment.
Most existing works evaluate their performance on the nuScenes dataset using the L2 error and collision rate between the predicted trajectories and the ground truth.
In this paper, we reevaluate these existing evaluation metrics and explore whether they accurately measure the superiority of different methods.
Our simple method achieves similar end-to-end planning performance on the nuScenes dataset with other perception-based methods, reducing the average L2 error by about 20%.
arXiv Detail & Related papers (2023-05-17T17:59:11Z) - Control-Aware Prediction Objectives for Autonomous Driving [78.19515972466063]
We present control-aware prediction objectives (CAPOs) to evaluate the downstream effect of predictions on control without requiring the planner be differentiable.
We propose two types of importance weights that weight the predictive likelihood: one using an attention model between agents, and another based on control variation when exchanging predicted trajectories for ground truth trajectories.
arXiv Detail & Related papers (2022-04-28T07:37:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.