UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving
- URL: http://arxiv.org/abs/2601.04453v1
- Date: Wed, 07 Jan 2026 23:49:52 GMT
- Title: UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving
- Authors: Zhexiao Xiong, Xin Ye, Burhan Yaman, Sheng Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, Liu Ren,
- Abstract summary: We propose a unified VLM-based world model that jointly performs driving-scene understanding, trajectory planning, and trajectory-conditioned future image generation.<n>Experiments on the Bench2Drive benchmark show that UniDrive-WM produces high-fidelity future images and improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate over the previous best method.
- Score: 29.623672055601418
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: World models have become central to autonomous driving, where accurate scene understanding and future prediction are crucial for safe control. Recent work has explored using vision-language models (VLMs) for planning, yet existing approaches typically treat perception, prediction, and planning as separate modules. We propose UniDrive-WM, a unified VLM-based world model that jointly performs driving-scene understanding, trajectory planning, and trajectory-conditioned future image generation within a single architecture. UniDrive-WM's trajectory planner predicts a future trajectory, which conditions a VLM-based image generator to produce plausible future frames. These predictions provide additional supervisory signals that enhance scene understanding and iteratively refine trajectory generation. We further compare discrete and continuous output representations for future image prediction, analyzing their influence on downstream driving performance. Experiments on the challenging Bench2Drive benchmark show that UniDrive-WM produces high-fidelity future images and improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate over the previous best method. These results demonstrate the advantages of tightly integrating VLM-driven reasoning, planning, and generative world modeling for autonomous driving. The project page is available at https://unidrive-wm.github.io/UniDrive-WM .
Related papers
- Future-Aware End-to-End Driving: Bidirectional Modeling of Trajectory Planning and Scene Evolution [96.25314747309811]
We introduce SeerDrive, a novel end-to-end framework that jointly models future scene evolution and trajectory planning.<n>Our method first predicts future bird's-eye view (BEV) representations to anticipate the dynamics of the surrounding scene.<n>Two key components enable this: (1) future-aware planning, which injects predicted BEV features into the trajectory planner, and (2) iterative scene modeling and vehicle planning.
arXiv Detail & Related papers (2025-10-13T07:41:47Z) - ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving [64.12414815634847]
Vision-Language Models (VLMs) and Driving World Models (DWMs) have independently emerged as powerful recipes addressing different aspects of this challenge.<n>We propose ImagiDrive, a novel end-to-end autonomous driving framework that integrates a VLM-based driving agent with a DWM-based scene imaginer.
arXiv Detail & Related papers (2025-08-15T12:06:55Z) - World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model [18.56171397212777]
We present World4Drive, an end-to-end autonomous driving framework that employs vision foundation models to build latent world models.<n>World4Drive achieves state-of-the-art performance without manual perception annotations on open-loop nuScenes and closed-loop NavSim benchmarks.
arXiv Detail & Related papers (2025-07-01T09:36:38Z) - ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving [49.07731497951963]
ReCogDrive is a novel Reinforced Cognitive framework for end-to-end autonomous driving.<n>We introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers.<n>We then address the language-action mismatch by injecting the VLM's learned driving priors into a diffusion planner.
arXiv Detail & Related papers (2025-06-09T03:14:04Z) - DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers [61.92571851411509]
We introduce a multimodal driving language based on interleaved image and action tokens, and develop DrivingGPT to learn joint world modeling and planning.<n>Our DrivingGPT demonstrates strong performance in both action-conditioned video generation and end-to-end planning, outperforming strong baselines on large-scale nuPlan and NAVSIM benchmarks.
arXiv Detail & Related papers (2024-12-24T18:59:37Z) - Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving [15.100104512786107]
Drive-OccWorld adapts a visioncentric- 4D forecasting world model to end-to-end planning for autonomous driving.<n>We propose injecting flexible action conditions, such as velocity, steering angle, trajectory, and commands, into the world model to enable controllable generation.<n>Our method can generate plausible and controllable 4D occupancy, paving the way for advancements in driving world generation and end-to-end planning.
arXiv Detail & Related papers (2024-08-26T11:53:09Z) - Enhancing End-to-End Autonomous Driving with Latent World Model [78.22157677787239]
We propose a novel self-supervised learning approach using the LAtent World model (LAW) for end-to-end driving.<n> LAW predicts future scene features based on current features and ego trajectories.<n>This self-supervised task can be seamlessly integrated into perception-free and perception-based frameworks.
arXiv Detail & Related papers (2024-06-12T17:59:21Z) - GenAD: Generative End-to-End Autonomous Driving [13.332272121018285]
GenAD is a generative framework that casts autonomous driving into a generative modeling problem.
We propose an instance-centric scene tokenizer that first transforms the surrounding scenes into map-aware instance tokens.
We then employ a variational autoencoder to learn the future trajectory distribution in a structural latent space for trajectory prior modeling.
arXiv Detail & Related papers (2024-02-18T08:21:05Z) - Driving into the Future: Multiview Visual Forecasting and Planning with
World Model for Autonomous Driving [56.381918362410175]
Drive-WM is the first driving world model compatible with existing end-to-end planning models.
Our model generates high-fidelity multiview videos in driving scenes.
arXiv Detail & Related papers (2023-11-29T18:59:47Z) - ADriver-I: A General World Model for Autonomous Driving [23.22507419707926]
We introduce the concept of interleaved vision-action pair, which unifies the format of visual features and control signals.
Based on the vision-action pairs, we construct a general world model based on MLLM and diffusion model for autonomous driving, termed ADriver-I.
It takes the vision-action pairs as inputs and autoregressively predicts the control signal of the current frame.
arXiv Detail & Related papers (2023-11-22T17:44:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.