Ctrl-World: A Controllable Generative World Model for Robot Manipulation
- URL: http://arxiv.org/abs/2510.10125v2
- Date: Wed, 15 Oct 2025 00:46:49 GMT
- Title: Ctrl-World: A Controllable Generative World Model for Robot Manipulation
- Authors: Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, Chelsea Finn,
- Abstract summary: Generalist robot policies can perform a wide range of manipulation skills.<n> evaluating and improving their ability with unfamiliar objects and instructions remains a significant challenge.<n>World models offer a promising, scalable alternative by enabling policies to rollout within imagination space.
- Score: 53.71061464925014
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generalist robot policies can now perform a wide range of manipulation skills, but evaluating and improving their ability with unfamiliar objects and instructions remains a significant challenge. Rigorous evaluation requires a large number of real-world rollouts, while systematic improvement demands additional corrective data with expert labels. Both of these processes are slow, costly, and difficult to scale. World models offer a promising, scalable alternative by enabling policies to rollout within imagination space. However, a key challenge is building a controllable world model that can handle multi-step interactions with generalist robot policies. This requires a world model compatible with modern generalist policies by supporting multi-view prediction, fine-grained action control, and consistent long-horizon interactions, which is not achieved by previous works. In this paper, we make a step forward by introducing a controllable multi-view world model that can be used to evaluate and improve the instruction-following ability of generalist robot policies. Our model maintains long-horizon consistency with a pose-conditioned memory retrieval mechanism and achieves precise action control through frame-level action conditioning. Trained on the DROID dataset (95k trajectories, 564 scenes), our model generates spatially and temporally consistent trajectories under novel scenarios and new camera placements for over 20 seconds. We show that our method can accurately rank policy performance without real-world robot rollouts. Moreover, by synthesizing successful trajectories in imagination and using them for supervised fine-tuning, our approach can improve policy success by 44.7\%.
Related papers
- DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos [110.98100817695307]
We introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos.<n>Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning.
arXiv Detail & Related papers (2026-02-06T18:49:43Z) - Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning [106.57043104902584]
We introduce Cosmos Policy, a simple approach for adapting a large pretrained video model into an effective robot policy.<n>Cosmos Policy learns to generate directly robot actions encoded as latent frames within the video model's latent diffusion process.<n>In our evaluations, Cosmos Policy achieves state-of-the-art performance on the LIBERO and RoboCasa simulation benchmarks.
arXiv Detail & Related papers (2026-01-22T18:09:30Z) - WorldGym: World Model as An Environment for Policy Evaluation [41.204900701616914]
WorldGym is an autoregressive, action-conditioned video generation model which serves as a proxy to real world environments.<n> Policies are evaluated via Monte Carlo rollouts in the world model, with a vision-language model providing rewards.<n>We show that WorldGym is able to preserve relative policy rankings across different policy versions, sizes, and training checkpoints.
arXiv Detail & Related papers (2025-05-31T15:51:56Z) - WorldEval: World Model as Real-World Robot Policies Evaluator [13.899692171641066]
A key challenge is generating accurate policy videos from world models that faithfully reflect the robot actions.<n>We propose Policy2Vec, a simple yet effective approach to turn a video generation model into a world simulator that follows latent action to generate the robot video.<n>We then introduce WorldEval, an automated pipeline designed to evaluate real-world robot policies entirely online.
arXiv Detail & Related papers (2025-05-25T07:41:39Z) - Action Flow Matching for Continual Robot Learning [54.10050120844738]
Continual learning in robotics seeks systems that can constantly adapt to changing environments and tasks.<n>We introduce a generative framework leveraging flow matching for online robot dynamics model alignment.<n>We find that by transforming the actions themselves rather than exploring with a misaligned model, the robot collects informative data more efficiently.
arXiv Detail & Related papers (2025-04-25T16:26:15Z) - RoboGrasp: A Universal Grasping Policy for Robust Robotic Control [8.189496387470726]
RoboGrasp is a universal grasping policy framework that integrates pretrained grasp detection models with robotic learning.<n>It significantly enhances grasp precision, stability, and generalizability, achieving up to 34% higher success rates in few-shot learning and grasping box prompt tasks.
arXiv Detail & Related papers (2025-02-05T11:04:41Z) - Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics [50.191655141020505]
This work advances model-based reinforcement learning by addressing the challenges of long-horizon prediction, error accumulation, and sim-to-real transfer.<n>By providing a scalable and robust framework, the introduced methods pave the way for adaptive and efficient robotic systems in real-world applications.
arXiv Detail & Related papers (2025-01-17T10:39:09Z) - GRAPPA: Generalizing and Adapting Robot Policies via Online Agentic Guidance [15.774237279917594]
We propose an agentic framework for robot self-guidance and self-improvement.<n>Our framework iteratively grounds a base robot policy to relevant objects in the environment.<n>We demonstrate that our approach can effectively guide manipulation policies to achieve significantly higher success rates.
arXiv Detail & Related papers (2024-10-09T02:00:37Z) - IRASim: A Fine-Grained World Model for Robot Manipulation [24.591694756757278]
We present IRASim, a novel world model capable of generating videos with fine-grained robot-object interaction details.<n>We train a diffusion transformer and introduce a novel frame-level action-conditioning module within each transformer block to explicitly model and strengthen the action-frame alignment.
arXiv Detail & Related papers (2024-06-20T17:50:16Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.