UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving
- URL: http://arxiv.org/abs/2602.02002v1
- Date: Mon, 02 Feb 2026 12:02:27 GMT
- Title: UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving
- Authors: Guosheng Zhao, Yaozeng Wang, Xiaofeng Wang, Zheng Zhu, Tingdong Yu, Guan Huang, Yongchen Zai, Ji Jiao, Changliang Xue, Xiaole Wang, Zhen Yang, Futang Zhu, Xingang Wang,
- Abstract summary: UniDriveDreamer is a single-stage unified multimodal world model for autonomous driving.<n>It generates multimodal future observations without relying on intermediate representations or cascaded modules.<n>It outperforms previous state-of-the-art methods in both video and LiDAR generation.
- Score: 34.278528623978204
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: World models have demonstrated significant promise for data synthesis in autonomous driving. However, existing methods predominantly concentrate on single-modality generation, typically focusing on either multi-camera video or LiDAR sequence synthesis. In this paper, we propose UniDriveDreamer, a single-stage unified multimodal world model for autonomous driving, which directly generates multimodal future observations without relying on intermediate representations or cascaded modules. Our framework introduces a LiDAR-specific variational autoencoder (VAE) designed to encode input LiDAR sequences, alongside a video VAE for multi-camera images. To ensure cross-modal compatibility and training stability, we propose Unified Latent Anchoring (ULA), which explicitly aligns the latent distributions of the two modalities. The aligned features are fused and processed by a diffusion transformer that jointly models their geometric correspondence and temporal evolution. Additionally, structured scene layout information is projected per modality as a conditioning signal to guide the synthesis. Extensive experiments demonstrate that UniDriveDreamer outperforms previous state-of-the-art methods in both video and LiDAR generation, while also yielding measurable improvements in downstream
Related papers
- OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving [58.693329943871355]
We propose OminiGen, which generates aligned multimodal sensor data in a unified framework.<n>Our approach leverages a shared Birdu 2019s Eye View (BEV) space to unify multimodal features.<n>UAE achieves multimodal sensor decoding through volume rendering, enabling accurate and flexible reconstruction.
arXiv Detail & Related papers (2025-12-16T09:18:15Z) - DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous Driving [65.7087560656003]
Generative diffusion models for end-to-end autonomous driving often suffer from mode collapse.<n>We propose DiffusionDriveV2, which leverages reinforcement learning to constrain low-quality modes and explore for superior trajectories.<n>This significantly enhances the overall output quality while preserving the inherent multimodality of its core Gaussian Mixture Model.
arXiv Detail & Related papers (2025-12-08T17:29:52Z) - WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving [9.719456684859606]
WAM-Diff is a framework that employs masked diffusion to refine a discrete sequence representing future ego-trajectories.<n>Our model achieves 91.0 PDMS on NAVSIM-v1 and 89.7S on NAVSIM-v2, demonstrating the effectiveness of masked diffusion for autonomous driving.
arXiv Detail & Related papers (2025-12-06T10:51:53Z) - DM$^3$T: Harmonizing Modalities via Diffusion for Multi-Object Tracking [10.270441242480482]
This paper proposes DM$3$T, a novel framework that reformulates multimodal fusion as an iterative feature alignment process.<n>Our approach performs iterative cross-modal harmonization through a proposed Cross-Modal Diffusion Fusion (C-MDF) module.<n>To further improve tracking robustness, we design a Hierarchical Tracker that adaptively handles confidence estimation.
arXiv Detail & Related papers (2025-11-28T06:02:58Z) - Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method [54.461213497603154]
Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities.<n>Nuplan-Occ is the largest occupancy dataset to date, constructed from the widely used Nuplan benchmark.<n>We develop a unified framework that jointly synthesizes high-quality occupancy, multi-view videos, and LiDAR point clouds.
arXiv Detail & Related papers (2025-10-27T03:52:45Z) - NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching [64.10695425442164]
We introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms.<n>Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks.<n>To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.
arXiv Detail & Related papers (2025-10-15T16:25:18Z) - ViLaD: A Large Vision Language Diffusion Framework for End-to-End Autonomous Driving [14.486548540613791]
We introduce ViLaD, a novel Large Vision Language Diffusion framework for end-to-end autonomous driving.<n>ViLaD enables parallel generation of entire driving decision sequences, significantly reducing computational latency.<n>We conduct comprehensive experiments on the nuScenes dataset, where ViLaD outperforms state-of-the-art autoregressive VLM baselines in both planning accuracy and inference speed.
arXiv Detail & Related papers (2025-08-18T04:01:56Z) - TransDiffuser: Diverse Trajectory Generation with Decorrelated Multi-modal Representation for End-to-end Autonomous Driving [20.679370777762987]
We propose TransDiffuser, an encoder-decoder based generative trajectory planning model.<n>We exploit a simple yet effective multi-modal representation decorrelation optimization mechanism during the denoising process.<n>TransDiffuser achieves the PDMS of 94.85 on the closed-loop planning-oriented benchmark NAVSIM.
arXiv Detail & Related papers (2025-05-14T12:10:41Z) - DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers [61.92571851411509]
We introduce a multimodal driving language based on interleaved image and action tokens, and develop DrivingGPT to learn joint world modeling and planning.<n>Our DrivingGPT demonstrates strong performance in both action-conditioned video generation and end-to-end planning, outperforming strong baselines on large-scale nuPlan and NAVSIM benchmarks.
arXiv Detail & Related papers (2024-12-24T18:59:37Z) - X-Drive: Cross-modality consistent multi-sensor data synthesis for driving scenarios [105.16073169351299]
We propose a novel framework, X-DRIVE, to model the joint distribution of point clouds and multi-view images.
Considering the distinct geometrical spaces of the two modalities, X-DRIVE conditions the synthesis of each modality on the corresponding local regions.
X-DRIVE allows for controllable generation through multi-level input conditions, including text, bounding box, image, and point clouds.
arXiv Detail & Related papers (2024-11-02T03:52:12Z) - DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation [10.296670127024045]
DriveScape is an end-to-end framework for multi-view, 3D condition-guided video generation.
Our Bi-Directional Modulated Transformer (BiMot) ensures precise alignment of 3D structural information.
DriveScape excels in video generation performance, achieving state-of-the-art results on the nuScenes dataset with an FID score of 8.34 and an FVD score of 76.39.
arXiv Detail & Related papers (2024-09-09T09:43:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.