Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models
- URL: http://arxiv.org/abs/2506.09042v3
- Date: Wed, 18 Jun 2025 17:37:28 GMT
- Title: Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models
- Authors: Xuanchi Ren, Yifan Lu, Tianshi Cao, Ruiyuan Gao, Shengyu Huang, Amirmojtaba Sabour, Tianchang Shen, Tobias Pfaff, Jay Zhangjie Wu, Runjian Chen, Seung Wook Kim, Jun Gao, Laura Leal-Taixe, Mike Chen, Sanja Fidler, Huan Ling,
- Abstract summary: We introduce Cosmos-Drive-Dreams, a synthetic data generation pipeline that aims to generate challenging scenarios.<n>Powering this pipeline is Cosmos-Drive, a suite of models specialized from NVIDIA Cosmos foundation model for the driving domain.<n>We showcase the utility of these models by applying Cosmos-Drive-Dreams to scale the quantity and diversity of driving with high-fidelity and challenging scenarios.
- Score: 59.30855532305708
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Collecting and annotating real-world data for safety-critical physical AI systems, such as Autonomous Vehicle (AV), is time-consuming and costly. It is especially challenging to capture rare edge cases, which play a critical role in training and testing of an AV system. To address this challenge, we introduce the Cosmos-Drive-Dreams - a synthetic data generation (SDG) pipeline that aims to generate challenging scenarios to facilitate downstream tasks such as perception and driving policy training. Powering this pipeline is Cosmos-Drive, a suite of models specialized from NVIDIA Cosmos world foundation model for the driving domain and are capable of controllable, high-fidelity, multi-view, and spatiotemporally consistent driving video generation. We showcase the utility of these models by applying Cosmos-Drive-Dreams to scale the quantity and diversity of driving datasets with high-fidelity and challenging scenarios. Experimentally, we demonstrate that our generated data helps in mitigating long-tail distribution problems and enhances generalization in downstream tasks such as 3D lane detection, 3D object detection and driving policy learning. We open source our pipeline toolkit, dataset and model weights through the NVIDIA's Cosmos platform. Project page: https://research.nvidia.com/labs/toronto-ai/cosmos_drive_dreams
Related papers
- SimWorld: A Unified Benchmark for Simulator-Conditioned Scene Generation via World Model [1.3700170633913733]
This paper proposes a simulator-conditioned scene generation engine based on world model.<n>By constructing a simulation system consistent with real-world scenes, simulation data and labels, which serve as the conditions for data generation in the world model, for any scenes can be collected.<n>Results show that these generated images significantly improve downstream perception models performance.
arXiv Detail & Related papers (2025-03-18T06:41:02Z) - Transfer Your Perspective: Controllable 3D Generation from Any Viewpoint in a Driving Scene [56.73568220959019]
Collaborative autonomous driving (CAV) seems like a promising direction, but collecting data for development is non-trivial.<n>We introduce a novel surrogate to the rescue, which is to generate realistic perception from different viewpoints in a driving scene.<n>We present the very first solution, using a combination of simulated collaborative data and real ego-car data.
arXiv Detail & Related papers (2025-02-10T17:07:53Z) - DrivingDojo Dataset: Advancing Interactive and Knowledge-Enriched Driving World Model [65.43473733967038]
We introduce DrivingDojo, the first dataset tailor-made for training interactive world models with complex driving dynamics.
Our dataset features video clips with a complete set of driving maneuvers, diverse multi-agent interplay, and rich open-world driving knowledge.
arXiv Detail & Related papers (2024-10-14T17:19:23Z) - DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation [10.296670127024045]
DriveScape is an end-to-end framework for multi-view, 3D condition-guided video generation.
Our Bi-Directional Modulated Transformer (BiMot) ensures precise alignment of 3D structural information.
DriveScape excels in video generation performance, achieving state-of-the-art results on the nuScenes dataset with an FID score of 8.34 and an FVD score of 76.39.
arXiv Detail & Related papers (2024-09-09T09:43:17Z) - GenDDS: Generating Diverse Driving Video Scenarios with Prompt-to-Video Generative Model [6.144680854063938]
GenDDS is a novel approach for generating driving scenarios for autonomous driving systems.
We employ the KITTI dataset, which includes real-world driving videos, to train the model.
We demonstrate that our model can generate high-quality driving videos that closely replicate the complexity and variability of real-world driving scenarios.
arXiv Detail & Related papers (2024-08-28T15:37:44Z) - SimGen: Simulator-conditioned Driving Scene Generation [50.03358485083602]
We introduce a simulator-conditioned scene generation framework called SimGen.<n>SimGen learns to generate diverse driving scenes by mixing data from the simulator and the real world.<n>It achieves superior generation quality and diversity while preserving controllability based on the text prompt and the layout pulled from a simulator.
arXiv Detail & Related papers (2024-06-13T17:58:32Z) - DriveDreamer: Towards Real-world-driven World Models for Autonomous
Driving [76.24483706445298]
We introduce DriveDreamer, a world model entirely derived from real-world driving scenarios.
In the initial phase, DriveDreamer acquires a deep understanding of structured traffic constraints, while the subsequent stage equips it with the ability to anticipate future states.
DriveDreamer enables the generation of realistic and reasonable driving policies, opening avenues for interaction and practical applications.
arXiv Detail & Related papers (2023-09-18T13:58:42Z) - IDD-3D: Indian Driving Dataset for 3D Unstructured Road Scenes [79.18349050238413]
Preparation and training of deploy-able deep learning architectures require the models to be suited to different traffic scenarios.
An unstructured and complex driving layout found in several developing countries such as India poses a challenge to these models.
We build a new dataset, IDD-3D, which consists of multi-modal data from multiple cameras and LiDAR sensors with 12k annotated driving LiDAR frames.
arXiv Detail & Related papers (2022-10-23T23:03:17Z) - DL-Traff: Survey and Benchmark of Deep Learning Models for Urban Traffic
Prediction [7.476566278759198]
By leveraging state-of-the-art deep learning technologies on such data, urban traffic prediction has drawn a lot attention in AI and Intelligent Transportation System community.
According to the specific modeling strategy, the state-of-the-art deep learning models can be divided into three categories: grid-based, graph-based, and time-series models.
arXiv Detail & Related papers (2021-08-20T10:08:26Z) - One Million Scenes for Autonomous Driving: ONCE Dataset [91.94189514073354]
We introduce the ONCE dataset for 3D object detection in the autonomous driving scenario.
The data is selected from 144 driving hours, which is 20x longer than the largest 3D autonomous driving dataset available.
We reproduce and evaluate a variety of self-supervised and semi-supervised methods on the ONCE dataset.
arXiv Detail & Related papers (2021-06-21T12:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.