Related papers: UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving

UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving

URL: http://arxiv.org/abs/2602.02002v1
Date: Mon, 02 Feb 2026 12:02:27 GMT
Title: UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving
Authors: Guosheng Zhao, Yaozeng Wang, Xiaofeng Wang, Zheng Zhu, Tingdong Yu, Guan Huang, Yongchen Zai, Ji Jiao, Changliang Xue, Xiaole Wang, Zhen Yang, Futang Zhu, Xingang Wang,
Abstract summary: UniDriveDreamer is a single-stage unified multimodal world model for autonomous driving.<n>It generates multimodal future observations without relying on intermediate representations or cascaded modules.<n>It outperforms previous state-of-the-art methods in both video and LiDAR generation.
Score: 34.278528623978204
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: World models have demonstrated significant promise for data synthesis in autonomous driving. However, existing methods predominantly concentrate on single-modality generation, typically focusing on either multi-camera video or LiDAR sequence synthesis. In this paper, we propose UniDriveDreamer, a single-stage unified multimodal world model for autonomous driving, which directly generates multimodal future observations without relying on intermediate representations or cascaded modules. Our framework introduces a LiDAR-specific variational autoencoder (VAE) designed to encode input LiDAR sequences, alongside a video VAE for multi-camera images. To ensure cross-modal compatibility and training stability, we propose Unified Latent Anchoring (ULA), which explicitly aligns the latent distributions of the two modalities. The aligned features are fused and processed by a diffusion transformer that jointly models their geometric correspondence and temporal evolution. Additionally, structured scene layout information is projected per modality as a conditioning signal to guide the synthesis. Extensive experiments demonstrate that UniDriveDreamer outperforms previous state-of-the-art methods in both video and LiDAR generation, while also yielding measurable improvements in downstream

Related papers

OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving [58.693329943871355]
We propose OminiGen, which generates aligned multimodal sensor data in a unified framework.<n>Our approach leverages a shared Birdu 2019s Eye View (BEV) space to unify multimodal features.<n>UAE achieves multimodal sensor decoding through volume rendering, enabling accurate and flexible reconstruction.
arXiv Detail & Related papers (2025-12-16T09:18:15Z)
DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous Driving [65.7087560656003]
Generative diffusion models for end-to-end autonomous driving often suffer from mode collapse.<n>We propose DiffusionDriveV2, which leverages reinforcement learning to constrain low-quality modes and explore for superior trajectories.<n>This significantly enhances the overall output quality while preserving the inherent multimodality of its core Gaussian Mixture Model.
arXiv Detail & Related papers (2025-12-08T17:29:52Z)
WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving [9.719456684859606]
WAM-Diff is a framework that employs masked diffusion to refine a discrete sequence representing future ego-trajectories.<n>Our model achieves 91.0 PDMS on NAVSIM-v1 and 89.7S on NAVSIM-v2, demonstrating the effectiveness of masked diffusion for autonomous driving.
arXiv Detail & Related papers (2025-12-06T10:51:53Z)
DM$^3$T: Harmonizing Modalities via Diffusion for Multi-Object Tracking [10.270441242480482]
This paper proposes DM$3$T, a novel framework that reformulates multimodal fusion as an iterative feature alignment process.<n>Our approach performs iterative cross-modal harmonization through a proposed Cross-Modal Diffusion Fusion (C-MDF) module.<n>To further improve tracking robustness, we design a Hierarchical Tracker that adaptively handles confidence estimation.
arXiv Detail & Related papers (2025-11-28T06:02:58Z)
Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method [54.461213497603154]
Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities.<n>Nuplan-Occ is the largest occupancy dataset to date, constructed from the widely used Nuplan benchmark.<n>We develop a unified framework that jointly synthesizes high-quality occupancy, multi-view videos, and LiDAR point clouds.
arXiv Detail & Related papers (2025-10-27T03:52:45Z)
NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching [64.10695425442164]
We introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms.<n>Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks.<n>To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.
arXiv Detail & Related papers (2025-10-15T16:25:18Z)
ViLaD: A Large Vision Language Diffusion Framework for End-to-End Autonomous Driving [14.486548540613791]
We introduce ViLaD, a novel Large Vision Language Diffusion framework for end-to-end autonomous driving.<n>ViLaD enables parallel generation of entire driving decision sequences, significantly reducing computational latency.<n>We conduct comprehensive experiments on the nuScenes dataset, where ViLaD outperforms state-of-the-art autoregressive VLM baselines in both planning accuracy and inference speed.
arXiv Detail & Related papers (2025-08-18T04:01:56Z)
TransDiffuser: Diverse Trajectory Generation with Decorrelated Multi-modal Representation for End-to-end Autonomous Driving [20.679370777762987]
We propose TransDiffuser, an encoder-decoder based generative trajectory planning model.<n>We exploit a simple yet effective multi-modal representation decorrelation optimization mechanism during the denoising process.<n>TransDiffuser achieves the PDMS of 94.85 on the closed-loop planning-oriented benchmark NAVSIM.
arXiv Detail & Related papers (2025-05-14T12:10:41Z)
DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers [61.92571851411509]
We introduce a multimodal driving language based on interleaved image and action tokens, and develop DrivingGPT to learn joint world modeling and planning.<n>Our DrivingGPT demonstrates strong performance in both action-conditioned video generation and end-to-end planning, outperforming strong baselines on large-scale nuPlan and NAVSIM benchmarks.
arXiv Detail & Related papers (2024-12-24T18:59:37Z)
X-Drive: Cross-modality consistent multi-sensor data synthesis for driving scenarios [105.16073169351299]
We propose a novel framework, X-DRIVE, to model the joint distribution of point clouds and multi-view images. Considering the distinct geometrical spaces of the two modalities, X-DRIVE conditions the synthesis of each modality on the corresponding local regions. X-DRIVE allows for controllable generation through multi-level input conditions, including text, bounding box, image, and point clouds.
arXiv Detail & Related papers (2024-11-02T03:52:12Z)
DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation [10.296670127024045]
DriveScape is an end-to-end framework for multi-view, 3D condition-guided video generation. Our Bi-Directional Modulated Transformer (BiMot) ensures precise alignment of 3D structural information. DriveScape excels in video generation performance, achieving state-of-the-art results on the nuScenes dataset with an FID score of 8.34 and an FVD score of 76.39.
arXiv Detail & Related papers (2024-09-09T09:43:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.