DreamGen: Unlocking Generalization in Robot Learning through Video World Models
- URL: http://arxiv.org/abs/2505.12705v2
- Date: Tue, 17 Jun 2025 22:33:35 GMT
- Title: DreamGen: Unlocking Generalization in Robot Learning through Video World Models
- Authors: Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, Loic Magne, Ajay Mandlekar, Avnish Narayan, You Liang Tan, Guanzhi Wang, Jing Wang, Qi Wang, Yinzhen Xu, Xiaohui Zeng, Kaiyuan Zheng, Ruijie Zheng, Ming-Yu Liu, Luke Zettlemoyer, Dieter Fox, Jan Kautz, Scott Reed, Yuke Zhu, Linxi Fan,
- Abstract summary: DreamGen is a pipeline for training robot policies that generalize across behaviors and environments through neural trajectories.<n>Our work establishes a promising new axis for scaling robot learning well beyond manual data collection.
- Score: 120.25799361925387
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce DreamGen, a simple yet highly effective 4-stage pipeline for training robot policies that generalize across behaviors and environments through neural trajectories - synthetic robot data generated from video world models. DreamGen leverages state-of-the-art image-to-video generative models, adapting them to the target robot embodiment to produce photorealistic synthetic videos of familiar or novel tasks in diverse environments. Since these models generate only videos, we recover pseudo-action sequences using either a latent action model or an inverse-dynamics model (IDM). Despite its simplicity, DreamGen unlocks strong behavior and environment generalization: a humanoid robot can perform 22 new behaviors in both seen and unseen environments, while requiring teleoperation data from only a single pick-and-place task in one environment. To evaluate the pipeline systematically, we introduce DreamGen Bench, a video generation benchmark that shows a strong correlation between benchmark performance and downstream policy success. Our work establishes a promising new axis for scaling robot learning well beyond manual data collection. Code available at https://github.com/NVIDIA/GR00T-Dreams.
Related papers
- ORV: 4D Occupancy-centric Robot Video Generation [33.360345403049685]
Acquiring real-world robotic simulation data through teleoperation is notoriously time-consuming and labor-intensive.<n>We propose ORV, an Occupancy-centric Robot Video generation framework, which utilizes 4D semantic occupancy sequences as a fine-grained representation.<n>By leveraging occupancy-based representations, ORV enables seamless translation of simulation data into photorealistic robot videos, while ensuring high temporal consistency and precise controllability.
arXiv Detail & Related papers (2025-06-03T17:00:32Z) - GR00T N1: An Open Foundation Model for Generalist Humanoid Robots [133.23509142762356]
General-purpose robots need a versatile body and an intelligent mind.<n>Recent advancements in humanoid robots have shown great promise as a hardware platform for building generalist autonomy.<n>We introduce GR00T N1, an open foundation model for humanoid robots.
arXiv Detail & Related papers (2025-03-18T21:06:21Z) - VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos.<n> VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z) - Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression [23.99292102237088]
We propose Heterogeneous Masked Autoregression (HMA) for modeling action-video dynamics.<n>After post-training, this model can be used as a video simulator for evaluating policies and generating synthetic data.
arXiv Detail & Related papers (2025-02-06T18:38:26Z) - Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation [74.70013315714336]
Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video.
Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data.
arXiv Detail & Related papers (2024-09-24T17:57:33Z) - Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments [26.66666135624716]
We present Robot Utility Models (RUMs), a framework for training and deploying zero-shot robot policies.
RUMs can generalize to new environments without any finetuning.
We train five utility models for opening cabinet doors, opening drawers, picking up napkins, picking up paper bags, and reorienting fallen objects.
arXiv Detail & Related papers (2024-09-09T17:59:50Z) - RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation [68.70755196744533]
RoboGen is a generative robotic agent that automatically learns diverse robotic skills at scale via generative simulation.
Our work attempts to extract the extensive and versatile knowledge embedded in large-scale models and transfer them to the field of robotics.
arXiv Detail & Related papers (2023-11-02T17:59:21Z) - MimicGen: A Data Generation System for Scalable Robot Learning using
Human Demonstrations [55.549956643032836]
MimicGen is a system for automatically synthesizing large-scale, rich datasets from only a small number of human demonstrations.
We show that robot agents can be effectively trained on this generated dataset by imitation learning to achieve strong performance in long-horizon and high-precision tasks.
arXiv Detail & Related papers (2023-10-26T17:17:31Z) - Affordances from Human Videos as a Versatile Representation for Robotics [31.248842798600606]
We train a visual affordance model that estimates where and how in the scene a human is likely to interact.
The structure of these behavioral affordances directly enables the robot to perform many complex tasks.
We show the efficacy of our approach, which we call VRB, across 4 real world environments, over 10 different tasks, and 2 robotic platforms operating in the wild.
arXiv Detail & Related papers (2023-04-17T17:59:34Z) - Masked World Models for Visual Control [90.13638482124567]
We introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning.
We demonstrate that our approach achieves state-of-the-art performance on a variety of visual robotic tasks.
arXiv Detail & Related papers (2022-06-28T18:42:27Z) - Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human
Videos [59.58105314783289]
Domain-agnostic Video Discriminator (DVD) learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task.
DVD can generalize by virtue of learning from a small amount of robot data with a broad dataset of human videos.
DVD can be combined with visual model predictive control to solve robotic manipulation tasks on a real WidowX200 robot in an unseen environment from a single human demo.
arXiv Detail & Related papers (2021-03-31T05:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.