Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control
- URL: http://arxiv.org/abs/2503.14492v2
- Date: Tue, 01 Apr 2025 21:14:20 GMT
- Title: Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control
- Authors: NVIDIA, :, Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, Dieter Fox, Yunhao Ge, Jinwei Gu, Ali Hassani, Michael Isaev, Pooya Jannaty, Shiyi Lan, Tobias Lasser, Huan Ling, Ming-Yu Liu, Xian Liu, Yifan Lu, Alice Luo, Qianli Ma, Hanzi Mao, Fabio Ramos, Xuanchi Ren, Tianchang Shen, Xinglong Sun, Shitao Tang, Ting-Chun Wang, Jay Wu, Jiashu Xu, Stella Xu, Kevin Xie, Yuchong Ye, Xiaodong Yang, Xiaohui Zeng, Yu Zeng,
- Abstract summary: We introduce Cosmos-Transfer, a conditional world generation model that can generate world simulations based on multiple spatial control inputs.<n>We conduct evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics2Real and autonomous vehicle data enrichment.
- Score: 97.98560001760126
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Cosmos-Transfer, a conditional world generation model that can generate world simulations based on multiple spatial control inputs of various modalities such as segmentation, depth, and edge. In the design, the spatial conditional scheme is adaptive and customizable. It allows weighting different conditional inputs differently at different spatial locations. This enables highly controllable world generation and finds use in various world-to-world transfer use cases, including Sim2Real. We conduct extensive evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics Sim2Real and autonomous vehicle data enrichment. We further demonstrate an inference scaling strategy to achieve real-time world generation with an NVIDIA GB200 NVL72 rack. To help accelerate research development in the field, we open-source our models and code at https://github.com/nvidia-cosmos/cosmos-transfer1.
Related papers
- GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving [16.101356013671833]
GAIA-2 is a latent diffusion world model that unifies capabilities within a single generative framework.
GAIA-2 supports controllable video generation conditioned on a rich set of structured inputs.
It generates high-resolution,temporally consistent multi-camera videos across geographically diverse driving environments.
arXiv Detail & Related papers (2025-03-26T13:11:35Z) - SimWorld: A Unified Benchmark for Simulator-Conditioned Scene Generation via World Model [1.3700170633913733]
This paper proposes a simulator-conditioned scene generation engine based on world model.<n>By constructing a simulation system consistent with real-world scenes, simulation data and labels, which serve as the conditions for data generation in the world model, for any scenes can be collected.<n>Results show that these generated images significantly improve downstream perception models performance.
arXiv Detail & Related papers (2025-03-18T06:41:02Z) - Trajectory World Models for Heterogeneous Environments [67.27233466954814]
Heterogeneity in sensors and actuators across environments poses a significant challenge to building large-scale pre-trained world models.<n>We introduce UniTraj, a unified dataset comprising over one million trajectories from 80 environments, designed to scale data while preserving critical diversity.<n>We propose TrajWorld, a novel architecture capable of flexibly handling varying sensor and actuator information and capturing environment dynamics in-context.
arXiv Detail & Related papers (2025-02-03T13:59:08Z) - OmniRe: Omni Urban Scene Reconstruction [78.99262488964423]
We introduce OmniRe, a comprehensive system for creating high-fidelity digital twins of dynamic real-world scenes from on-device logs.
Our approach builds scene graphs on 3DGS and constructs multiple Gaussian representations in canonical spaces that model various dynamic actors.
arXiv Detail & Related papers (2024-08-29T17:56:33Z) - Domain Adaptive Graph Neural Networks for Constraining Cosmological Parameters Across Multiple Data Sets [40.19690479537335]
We show that DA-GNN achieves higher accuracy and robustness on cross-dataset tasks.
This shows that DA-GNNs are a promising method for extracting domain-independent cosmological information.
arXiv Detail & Related papers (2023-11-02T20:40:21Z) - Drive Anywhere: Generalizable End-to-end Autonomous Driving with
Multi-modal Foundation Models [114.69732301904419]
We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text.
Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
arXiv Detail & Related papers (2023-10-26T17:56:35Z) - TrafficBots: Towards World Models for Autonomous Driving Simulation and
Motion Prediction [149.5716746789134]
We show data-driven traffic simulation can be formulated as a world model.
We present TrafficBots, a multi-agent policy built upon motion prediction and end-to-end driving.
Experiments on the open motion dataset show TrafficBots can simulate realistic multi-agent behaviors.
arXiv Detail & Related papers (2023-03-07T18:28:41Z) - Deep Generative Framework for Interactive 3D Terrain Authoring and
Manipulation [4.202216894379241]
We propose a novel realistic terrain authoring framework powered by a combination of VAE and generative conditional GAN model.
Our framework is an example-based method that attempts to overcome the limitations of existing methods by learning a latent space from a real-world terrain dataset.
We also developed an interactive tool, that lets the user generate diverse terrains with minimalist inputs.
arXiv Detail & Related papers (2022-01-07T08:58:01Z) - Towards Optimal Strategies for Training Self-Driving Perception Models
in Simulation [98.51313127382937]
We focus on the use of labels in the synthetic domain alone.
Our approach introduces both a way to learn neural-invariant representations and a theoretically inspired view on how to sample the data from the simulator.
We showcase our approach on the bird's-eye-view vehicle segmentation task with multi-sensor data.
arXiv Detail & Related papers (2021-11-15T18:37:43Z) - Cycle-Consistent World Models for Domain Independent Latent Imagination [0.0]
High costs and risks make it hard to train autonomous cars in the real world.
We propose a novel model-based reinforcement learning approach called Cycleconsistent World Models.
arXiv Detail & Related papers (2021-10-02T13:55:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.