Related papers: CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving

CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving

URL: http://arxiv.org/abs/2511.22532v1
Date: Thu, 27 Nov 2025 15:13:13 GMT
Title: CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving
Authors: Zhaohui Wang, Tengbo Yu, Hao Tang,
Abstract summary: We propose Chain-of-Thought (CoT) reasoning for autonomous driving to enhance both numerical and causal reasoning in Vision-Language Models (VLMs)<n>CoT4AD integrates visual observations and language instructions to perform semantic reasoning, scene understanding, and trajectory planning.<n>Experiments on both real-world and simulated benchmarks, including nuScenes and Bench2Drive, demonstrate that CoT4AD achieves state-of-the-art performance in both open-loop and closed-loop evaluations.
Score: 10.836513600206118
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language-Action (VLA) models have recently attracted growing attention in end-to-end autonomous driving for their strong reasoning capabilities and rich world knowledge. However, existing VLAs often suffer from limited numerical reasoning ability and overly simplified input-output mappings, which hinder their performance in complex driving scenarios requiring step-by-step causal reasoning. To address these challenges, we propose CoT4AD, a novel VLA framework that introduces Chain-of-Thought (CoT) reasoning for autonomous driving to enhance both numerical and causal reasoning in Vision-Language Models (VLMs). CoT4AD integrates visual observations and language instructions to perform semantic reasoning, scene understanding, and trajectory planning. During training, it explicitly models a perception-question-prediction-action CoT to align the reasoning space with the action space across multiple driving tasks. During inference, it performs implicit CoT reasoning to enable consistent numerical reasoning and robust decision-making in dynamic environments. Extensive experiments on both real-world and simulated benchmarks, including nuScenes and Bench2Drive, demonstrate that CoT4AD achieves state-of-the-art performance in both open-loop and closed-loop evaluations. Code will be released upon paper acceptance.

Related papers

FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation [11.18316873483782]
Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context.<n>Recent works demonstrate the potential of Chain-of-Thought (CoT) reasoning for improving interpretability and long-horizon planning.<n>We propose FantasyVLN, a unified implicit reasoning framework that preserves the benefits of CoT reasoning without explicit token overhead.
arXiv Detail & Related papers (2026-01-20T13:54:10Z)
MindDrive: A Vision-Language-Action Model for Autonomous Driving via Online Reinforcement Learning [51.20229133553804]
Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL)<n>Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning.<n>We propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters.<n>By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions.
arXiv Detail & Related papers (2025-12-15T18:31:32Z)
Latent Chain-of-Thought World Modeling for End-to-End Driving [45.726304769312414]
We present Latent-CoT-Drive (LCDrive), a model that expresses CoT in a latent language.<n>Our approach unifies CoT reasoning and decision making by representing both in an action-aligned latent space.<n>On a large-scale end-to-end driving benchmark, LCDrive achieves faster inference, better trajectory quality, and larger improvements from interactive reinforcement learning.
arXiv Detail & Related papers (2025-12-11T02:22:07Z)
dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning [69.36145467833498]
We introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving.<n> evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems.
arXiv Detail & Related papers (2025-12-04T05:05:41Z)
ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving [64.12414815634847]
Vision-Language Models (VLMs) and Driving World Models (DWMs) have independently emerged as powerful recipes addressing different aspects of this challenge.<n>We propose ImagiDrive, a novel end-to-end autonomous driving framework that integrates a VLM-based driving agent with a DWM-based scene imaginer.
arXiv Detail & Related papers (2025-08-15T12:06:55Z)
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning [37.176428069948535]
Vision-Language-Action (VLA) models have shown promise for end-to-end autonomous driving.<n>Current VLA models struggle with physically infeasible action outputs, complex model structures, or unnecessarily long reasoning.<n>We propose AutoVLA, a novel VLA model that unifies reasoning and action generation within a single autoregressive generation model.
arXiv Detail & Related papers (2025-06-16T17:58:50Z)
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving [49.07731497951963]
ReCogDrive is a novel Reinforced Cognitive framework for end-to-end autonomous driving.<n>We introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers.<n>We then address the language-action mismatch by injecting the VLM's learned driving priors into a diffusion planner.
arXiv Detail & Related papers (2025-06-09T03:14:04Z)
DriveRX: A Vision-Language Reasoning Model for Cross-Task Autonomous Driving [22.293019898794963]
We present AutoDriveRL, a unified training framework that formulates autonomous driving as a structured reasoning process over four core tasks.<n>Within this framework, we train DriveRX, a cross-task reasoning VLM designed for real-time decision-making.<n>Our analysis highlights the impact of vision encoder design and reward-guided reasoning compression.
arXiv Detail & Related papers (2025-05-27T03:21:04Z)
Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios [69.00444996464662]
We propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables vision-language models to reason using visual crops corresponding to relevant entities.<n>Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting.
arXiv Detail & Related papers (2025-01-08T18:31:16Z)
Towards Interactive and Learnable Cooperative Driving Automation: a Large Language Model-Driven Decision-Making Framework [87.7482313774741]
Connected Autonomous Vehicles (CAVs) have begun to open road testing around the world, but their safety and efficiency performance in complex scenarios is still not satisfactory.<n>This paper proposes CoDrivingLLM, an interactive and learnable LLM-driven cooperative driving framework.
arXiv Detail & Related papers (2024-09-19T14:36:00Z)
Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving [38.28159034562901]
Reason2Drive is a benchmark dataset with over 600K video-text pairs. We characterize the autonomous driving process as a sequential combination of perception, prediction, and reasoning steps. We introduce a novel aggregated evaluation metric to assess chain-based reasoning performance in autonomous systems.
arXiv Detail & Related papers (2023-12-06T18:32:33Z)
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model [84.29836263441136]
This study introduces DriveGPT4, a novel interpretable end-to-end autonomous driving system based on multimodal large language models (MLLMs) DriveGPT4 facilitates the interpretation of vehicle actions, offers pertinent reasoning, and effectively addresses a diverse range of questions posed by users.
arXiv Detail & Related papers (2023-10-02T17:59:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.