Related papers: DreamGen: Unlocking Generalization in Robot Learning through Video World Models

DreamGen: Unlocking Generalization in Robot Learning through Video World Models

URL: http://arxiv.org/abs/2505.12705v2
Date: Tue, 17 Jun 2025 22:33:35 GMT
Title: DreamGen: Unlocking Generalization in Robot Learning through Video World Models
Authors: Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, Loic Magne, Ajay Mandlekar, Avnish Narayan, You Liang Tan, Guanzhi Wang, Jing Wang, Qi Wang, Yinzhen Xu, Xiaohui Zeng, Kaiyuan Zheng, Ruijie Zheng, Ming-Yu Liu, Luke Zettlemoyer, Dieter Fox, Jan Kautz, Scott Reed, Yuke Zhu, Linxi Fan,
Abstract summary: DreamGen is a pipeline for training robot policies that generalize across behaviors and environments through neural trajectories.<n>Our work establishes a promising new axis for scaling robot learning well beyond manual data collection.
Score: 120.25799361925387
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce DreamGen, a simple yet highly effective 4-stage pipeline for training robot policies that generalize across behaviors and environments through neural trajectories - synthetic robot data generated from video world models. DreamGen leverages state-of-the-art image-to-video generative models, adapting them to the target robot embodiment to produce photorealistic synthetic videos of familiar or novel tasks in diverse environments. Since these models generate only videos, we recover pseudo-action sequences using either a latent action model or an inverse-dynamics model (IDM). Despite its simplicity, DreamGen unlocks strong behavior and environment generalization: a humanoid robot can perform 22 new behaviors in both seen and unseen environments, while requiring teleoperation data from only a single pick-and-place task in one environment. To evaluate the pipeline systematically, we introduce DreamGen Bench, a video generation benchmark that shows a strong correlation between benchmark performance and downstream policy success. Our work establishes a promising new axis for scaling robot learning well beyond manual data collection. Code available at https://github.com/NVIDIA/GR00T-Dreams.

Related papers

World Action Models are Zero-shot Policies [111.91938055103633]
We introduce DreamZero, a World Action Model (WAM) built upon a pretrained video diffusion backbone.<n>By jointly modeling video and action, DreamZero learns diverse skills effectively from heterogeneous robot data.<n>We demonstrate two forms of cross-embodiment transfer: video-only demonstrations from other robots or humans yield a relative improvement of over 42% on unseen task performance.
arXiv Detail & Related papers (2026-02-17T15:04:02Z)
Large Video Planner Enables Generalizable Robot Control [117.49024534548319]
General-purpose robots require decision-making models that generalize across diverse tasks and environments.<n>Recent works build robot foundation models by extending multimodal large language models (LMs) with action outputs, creating vision--action (VLA) systems.<n>We explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models.
arXiv Detail & Related papers (2025-12-17T18:35:54Z)
AnchorDream: Repurposing Video Diffusion for Embodiment-Aware Robot Data Synthesis [33.90053396451562]
We introduce AnchorDream, an embodiment-aware world model that repurposes pretrained video diffusion models for robot data synthesis.<n>Our method scales them into large, diverse, high-quality datasets without requiring explicit environment modeling.<n>Experiments show that the generated data leads to consistent improvements in downstream policy learning, with relative gains of 36.4% in simulator benchmarks and nearly double performance in real-world studies.
arXiv Detail & Related papers (2025-12-12T18:59:45Z)
From Generated Human Videos to Physically Plausible Robot Trajectories [103.28274349461607]
Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts.<n>To realize this potential, how can a humanoid execute the human actions from generated videos in a zero-shot manner?<n>This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video.<n>We propose GenMimic, a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards.
arXiv Detail & Related papers (2025-12-04T18:56:03Z)
ORV: 4D Occupancy-centric Robot Video Generation [33.360345403049685]
Acquiring real-world robotic simulation data through teleoperation is notoriously time-consuming and labor-intensive.<n>We propose ORV, an Occupancy-centric Robot Video generation framework, which utilizes 4D semantic occupancy sequences as a fine-grained representation.<n>By leveraging occupancy-based representations, ORV enables seamless translation of simulation data into photorealistic robot videos, while ensuring high temporal consistency and precise controllability.
arXiv Detail & Related papers (2025-06-03T17:00:32Z)
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots [133.23509142762356]
General-purpose robots need a versatile body and an intelligent mind.<n>Recent advancements in humanoid robots have shown great promise as a hardware platform for building generalist autonomy.<n>We introduce GR00T N1, an open foundation model for humanoid robots.
arXiv Detail & Related papers (2025-03-18T21:06:21Z)
VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos.<n> VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z)
Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression [23.99292102237088]
We propose Heterogeneous Masked Autoregression (HMA) for modeling action-video dynamics.<n>After post-training, this model can be used as a video simulator for evaluating policies and generating synthetic data.
arXiv Detail & Related papers (2025-02-06T18:38:26Z)
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation [74.70013315714336]
Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data.
arXiv Detail & Related papers (2024-09-24T17:57:33Z)
Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments [26.66666135624716]
We present Robot Utility Models (RUMs), a framework for training and deploying zero-shot robot policies. RUMs can generalize to new environments without any finetuning. We train five utility models for opening cabinet doors, opening drawers, picking up napkins, picking up paper bags, and reorienting fallen objects.
arXiv Detail & Related papers (2024-09-09T17:59:50Z)
RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation [68.70755196744533]
RoboGen is a generative robotic agent that automatically learns diverse robotic skills at scale via generative simulation. Our work attempts to extract the extensive and versatile knowledge embedded in large-scale models and transfer them to the field of robotics.
arXiv Detail & Related papers (2023-11-02T17:59:21Z)
MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations [55.549956643032836]
MimicGen is a system for automatically synthesizing large-scale, rich datasets from only a small number of human demonstrations. We show that robot agents can be effectively trained on this generated dataset by imitation learning to achieve strong performance in long-horizon and high-precision tasks.
arXiv Detail & Related papers (2023-10-26T17:17:31Z)
Affordances from Human Videos as a Versatile Representation for Robotics [31.248842798600606]
We train a visual affordance model that estimates where and how in the scene a human is likely to interact. The structure of these behavioral affordances directly enables the robot to perform many complex tasks. We show the efficacy of our approach, which we call VRB, across 4 real world environments, over 10 different tasks, and 2 robotic platforms operating in the wild.
arXiv Detail & Related papers (2023-04-17T17:59:34Z)
Masked World Models for Visual Control [90.13638482124567]
We introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning. We demonstrate that our approach achieves state-of-the-art performance on a variety of visual robotic tasks.
arXiv Detail & Related papers (2022-06-28T18:42:27Z)
Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human Videos [59.58105314783289]
Domain-agnostic Video Discriminator (DVD) learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task. DVD can generalize by virtue of learning from a small amount of robot data with a broad dataset of human videos. DVD can be combined with visual model predictive control to solve robotic manipulation tasks on a real WidowX200 robot in an unseen environment from a single human demo.
arXiv Detail & Related papers (2021-03-31T05:25:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.