Latent Chain-of-Thought World Modeling for End-to-End Driving
- URL: http://arxiv.org/abs/2512.10226v1
- Date: Thu, 11 Dec 2025 02:22:07 GMT
- Title: Latent Chain-of-Thought World Modeling for End-to-End Driving
- Authors: Shuhan Tan, Kashyap Chitta, Yuxiao Chen, Ran Tian, Yurong You, Yan Wang, Wenjie Luo, Yulong Cao, Philipp Krahenbuhl, Marco Pavone, Boris Ivanovic,
- Abstract summary: We present Latent-CoT-Drive (LCDrive), a model that expresses CoT in a latent language.<n>Our approach unifies CoT reasoning and decision making by representing both in an action-aligned latent space.<n>On a large-scale end-to-end driving benchmark, LCDrive achieves faster inference, better trajectory quality, and larger improvements from interactive reinforcement learning.
- Score: 45.726304769312414
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent Vision-Language-Action (VLA) models for autonomous driving explore inference-time reasoning as a way to improve driving performance and safety in challenging scenarios. Most prior work uses natural language to express chain-of-thought (CoT) reasoning before producing driving actions. However, text may not be the most efficient representation for reasoning. In this work, we present Latent-CoT-Drive (LCDrive): a model that expresses CoT in a latent language that captures possible outcomes of the driving actions being considered. Our approach unifies CoT reasoning and decision making by representing both in an action-aligned latent space. Instead of natural language, the model reasons by interleaving (1) action-proposal tokens, which use the same vocabulary as the model's output actions; and (2) world model tokens, which are grounded in a learned latent world model and express future outcomes of these actions. We cold start latent CoT by supervising the model's action proposals and world model tokens based on ground-truth future rollouts of the scene. We then post-train with closed-loop reinforcement learning to strengthen reasoning capabilities. On a large-scale end-to-end driving benchmark, LCDrive achieves faster inference, better trajectory quality, and larger improvements from interactive reinforcement learning compared to both non-reasoning and text-reasoning baselines.
Related papers
- Unifying Language-Action Understanding and Generation for Autonomous Driving [25.23561391638388]
Vision-Language-Action (VLA) models are emerging as a promising paradigm for end-to-end autonomous driving.<n>Existing methods suffer from two critical limitations: a persistent misalignment between language instructions and action outputs, and the inherent inefficiency of typical auto-regressive action generation.<n>We introduce LinkVLA, a novel architecture that directly addresses these challenges to enhance both alignment and efficiency.
arXiv Detail & Related papers (2026-03-02T04:41:10Z) - Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion [23.834662472392694]
Masked Vision-Language-Action Diffusion for Autonomous Driving (MVLAD-AD) is a novel framework designed to bridge the gap between efficient planning and semantic explainability.<n>We introduce a discrete action tokenization strategy that constructs a compact codebook of kinematically feasible waypoints from real-world driving distributions.<n>Experiments on nuScenes and derived benchmarks demonstrate that MVLAD-AD achieves superior efficiency and outperforms state-of-the-art autoregressive and diffusion baselines in planning precision.
arXiv Detail & Related papers (2026-02-24T05:59:10Z) - Generative Scenario Rollouts for End-to-End Autonomous Driving [58.99809446189301]
Vision-Language-Action (VLA) models are emerging as highly effective planning models for end-to-end autonomous driving systems.<n>We propose Generative Scenario Rollouts (GeRo), a plug-and-play framework for VLA models that jointly performs planning and generation of language-grounded future traffic scenes.
arXiv Detail & Related papers (2026-01-16T17:59:28Z) - MindDrive: A Vision-Language-Action Model for Autonomous Driving via Online Reinforcement Learning [51.20229133553804]
Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL)<n>Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning.<n>We propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters.<n>By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions.
arXiv Detail & Related papers (2025-12-15T18:31:32Z) - CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving [10.836513600206118]
We propose Chain-of-Thought (CoT) reasoning for autonomous driving to enhance both numerical and causal reasoning in Vision-Language Models (VLMs)<n>CoT4AD integrates visual observations and language instructions to perform semantic reasoning, scene understanding, and trajectory planning.<n>Experiments on both real-world and simulated benchmarks, including nuScenes and Bench2Drive, demonstrate that CoT4AD achieves state-of-the-art performance in both open-loop and closed-loop evaluations.
arXiv Detail & Related papers (2025-11-27T15:13:13Z) - ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving [64.12414815634847]
Vision-Language Models (VLMs) and Driving World Models (DWMs) have independently emerged as powerful recipes addressing different aspects of this challenge.<n>We propose ImagiDrive, a novel end-to-end autonomous driving framework that integrates a VLM-based driving agent with a DWM-based scene imaginer.
arXiv Detail & Related papers (2025-08-15T12:06:55Z) - AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning [37.176428069948535]
Vision-Language-Action (VLA) models have shown promise for end-to-end autonomous driving.<n>Current VLA models struggle with physically infeasible action outputs, complex model structures, or unnecessarily long reasoning.<n>We propose AutoVLA, a novel VLA model that unifies reasoning and action generation within a single autoregressive generation model.
arXiv Detail & Related papers (2025-06-16T17:58:50Z) - ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving [49.07731497951963]
ReCogDrive is a novel Reinforced Cognitive framework for end-to-end autonomous driving.<n>We introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers.<n>We then address the language-action mismatch by injecting the VLM's learned driving priors into a diffusion planner.
arXiv Detail & Related papers (2025-06-09T03:14:04Z) - DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers [61.92571851411509]
We introduce a multimodal driving language based on interleaved image and action tokens, and develop DrivingGPT to learn joint world modeling and planning.<n>Our DrivingGPT demonstrates strong performance in both action-conditioned video generation and end-to-end planning, outperforming strong baselines on large-scale nuPlan and NAVSIM benchmarks.
arXiv Detail & Related papers (2024-12-24T18:59:37Z) - Doe-1: Closed-Loop Autonomous Driving with Large World Model [63.99937807085461]
We propose a large Driving wOrld modEl (Doe-1) for unified perception, prediction, and planning.<n>We use free-form texts for perception and generate future predictions directly in the RGB space with image tokens.<n>For planning, we employ a position-aware tokenizer to effectively encode action into discrete tokens.
arXiv Detail & Related papers (2024-12-12T18:59:59Z) - Multimodal Chain-of-Thought Reasoning in Language Models [94.70184390935661]
We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework.
Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach.
arXiv Detail & Related papers (2023-02-02T07:51:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.