Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks
- URL: http://arxiv.org/abs/2510.19195v2
- Date: Fri, 24 Oct 2025 10:10:43 GMT
- Title: Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks
- Authors: Kai Zeng, Zhanqian Wu, Kaixin Xiong, Xiaobao Wei, Xiangyu Guo, Zhenxin Zhu, Kalok Ho, Lijun Zhou, Bohan Zeng, Ming Lu, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Wentao Zhang,
- Abstract summary: We introduce Dream4Drive, a novel synthetic data generation framework for enhancing the downstream perception tasks.<n>Dream4Drive decomposes the input video into several 3D-aware guidance maps and renders the 3D assets onto these guidance maps.<n>The driving world model is fine-tuned to produce the edited, multi-view videos, which can be used to train the downstream perception models.
- Score: 33.747369815484326
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in driving world models enable controllable generation of high-quality RGB videos or multimodal videos. Existing methods primarily focus on metrics related to generation quality and controllability. However, they often overlook the evaluation of downstream perception tasks, which are $\mathbf{really\ crucial}$ for the performance of autonomous driving. Existing methods usually leverage a training strategy that first pretrains on synthetic data and finetunes on real data, resulting in twice the epochs compared to the baseline (real data only). When we double the epochs in the baseline, the benefit of synthetic data becomes negligible. To thoroughly demonstrate the benefit of synthetic data, we introduce Dream4Drive, a novel synthetic data generation framework designed for enhancing the downstream perception tasks. Dream4Drive first decomposes the input video into several 3D-aware guidance maps and subsequently renders the 3D assets onto these guidance maps. Finally, the driving world model is fine-tuned to produce the edited, multi-view photorealistic videos, which can be used to train the downstream perception models. Dream4Drive enables unprecedented flexibility in generating multi-view corner cases at scale, significantly boosting corner case perception in autonomous driving. To facilitate future research, we also contribute a large-scale 3D asset dataset named DriveObj3D, covering the typical categories in driving scenarios and enabling diverse 3D-aware video editing. We conduct comprehensive experiments to show that Dream4Drive can effectively boost the performance of downstream perception models under various training epochs. Page: https://wm-research.github.io/Dream4Drive/ GitHub Link: https://github.com/wm-research/Dream4Drive
Related papers
- GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation [80.1493315900789]
We propose GenieDrive, a framework for physics-aware driving video generation.<n>Our approach starts by generating 4D occupancy, which serves as a physics-informed foundation for subsequent video generation.<n>Experiments demonstrate that GenieDrive enables highly controllable, multi-view consistent, and physics-aware driving video generation.
arXiv Detail & Related papers (2025-12-14T16:23:51Z) - Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models [59.30855532305708]
We introduce Cosmos-Drive-Dreams, a synthetic data generation pipeline that aims to generate challenging scenarios.<n>Powering this pipeline is Cosmos-Drive, a suite of models specialized from NVIDIA Cosmos foundation model for the driving domain.<n>We showcase the utility of these models by applying Cosmos-Drive-Dreams to scale the quantity and diversity of driving with high-fidelity and challenging scenarios.
arXiv Detail & Related papers (2025-06-10T17:58:17Z) - Transfer Your Perspective: Controllable 3D Generation from Any Viewpoint in a Driving Scene [56.73568220959019]
Collaborative autonomous driving (CAV) seems like a promising direction, but collecting data for development is non-trivial.<n>We introduce a novel surrogate to the rescue, which is to generate realistic perception from different viewpoints in a driving scene.<n>We present the very first solution, using a combination of simulated collaborative data and real ego-car data.
arXiv Detail & Related papers (2025-02-10T17:07:53Z) - DreamDrive: Generative 4D Scene Modeling from Street View Images [55.45852373799639]
We present DreamDrive, a 4D spatial-temporal scene generation approach that combines the merits of generation and reconstruction.<n>Specifically, we leverage the generative power of video diffusion models to synthesize a sequence of visual references.<n>We then render 3D-consistent driving videos via Gaussian splatting.
arXiv Detail & Related papers (2024-12-31T18:59:57Z) - DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation [32.19534057884047]
We introduce DriveDreamer4D, which enhances 4D driving scene representation leveraging world model priors.
To our knowledge, DriveDreamer4D is the first to utilize video generation models for improving 4D reconstruction in driving scenarios.
arXiv Detail & Related papers (2024-10-17T14:07:46Z) - MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes [72.02827211293736]
MagicDrive3D is a novel framework for controllable 3D street scene generation.<n>It supports multi-condition control, including road maps, 3D objects, and text descriptions.<n>It generates diverse, high-quality 3D driving scenes, supports any-view rendering, and enhances downstream tasks like BEV segmentation.
arXiv Detail & Related papers (2024-05-23T12:04:51Z) - DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation [32.30436679335912]
We propose DriveDreamer-2, which builds upon the framework of DriveDreamer to generate user-defined driving videos.
Ultimately, we propose the Unified Multi-View Model to enhance temporal and spatial coherence in the generated driving videos.
arXiv Detail & Related papers (2024-03-11T16:03:35Z) - Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data
Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images.
Our approach is fully automatic without any human interaction.
We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.