WorldGym: World Model as An Environment for Policy Evaluation
- URL: http://arxiv.org/abs/2506.00613v3
- Date: Tue, 30 Sep 2025 03:34:34 GMT
- Title: WorldGym: World Model as An Environment for Policy Evaluation
- Authors: Julian Quevedo, Ansh Kumar Sharma, Yixiang Sun, Varad Suryavanshi, Percy Liang, Sherry Yang,
- Abstract summary: WorldGym is an autoregressive, action-conditioned video generation model which serves as a proxy to real world environments.<n> Policies are evaluated via Monte Carlo rollouts in the world model, with a vision-language model providing rewards.<n>We show that WorldGym is able to preserve relative policy rankings across different policy versions, sizes, and training checkpoints.
- Score: 41.204900701616914
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating robot control policies is difficult: real-world testing is costly, and handcrafted simulators require manual effort to improve in realism and generality. We propose a world-model-based policy evaluation environment (WorldGym), an autoregressive, action-conditioned video generation model which serves as a proxy to real world environments. Policies are evaluated via Monte Carlo rollouts in the world model, with a vision-language model providing rewards. We evaluate a set of VLA-based real-robot policies in the world model using only initial frames from real robots, and show that policy success rates within the world model highly correlate with real-world success rates. Moreoever, we show that WorldGym is able to preserve relative policy rankings across different policy versions, sizes, and training checkpoints. Due to requiring only a single start frame as input, the world model further enables efficient evaluation of robot policies' generalization ability on novel tasks and environments. We find that modern VLA-based robot policies still struggle to distinguish object shapes and can become distracted by adversarial facades of objects. While generating highly realistic object interaction remains challenging, WorldGym faithfully emulates robot motions and offers a practical starting point for safe and reproducible policy evaluation before deployment.
Related papers
- PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies [88.78188489161028]
We introduce Policy Evaluation and Environment Reconstruction in Simulation (PolaRiS)<n>PolaRiS is a scalable real-to-sim framework for high-fidelity simulated robot evaluation.<n>We show that PolaRiS evaluations provide a much stronger correlation to real world generalist policy performance than existing simulated benchmarks.
arXiv Detail & Related papers (2025-12-18T18:49:41Z) - Evaluating Gemini Robotics Policies in a Veo World Simulator [69.23071832313246]
We introduce a generative evaluation system built upon a frontier video foundation model (Veo)<n>The system is optimized to support robot action conditioning and multi-view consistency.<n>We validate these capabilities through 1600+ real-world evaluations of eight Gemini Robotics policy checkpoints and five tasks for a bimanual manipulator.
arXiv Detail & Related papers (2025-12-11T14:22:14Z) - RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation [47.79800816696372]
Real-world testing of manipulation policies is labor-intensive at scale, and difficult to reproduce.<n>Existing simulation benchmarks are similarly limited, as they train and test policies within the same synthetic domains.<n>In this paper, we introduce a new benchmarking framework that overcomes these challenges by shifting VLA evaluation into large-scale simulated augmented environments.
arXiv Detail & Related papers (2025-10-27T17:41:38Z) - Ctrl-World: A Controllable Generative World Model for Robot Manipulation [53.71061464925014]
Generalist robot policies can perform a wide range of manipulation skills.<n> evaluating and improving their ability with unfamiliar objects and instructions remains a significant challenge.<n>World models offer a promising, scalable alternative by enabling policies to rollout within imagination space.
arXiv Detail & Related papers (2025-10-11T09:13:10Z) - World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation [23.270985761700203]
We propose World4RL, a framework that employs diffusion-based world models as high-fidelity simulators to refine pre-trained policies for robotic manipulation.<n>World4RL provides high-fidelity environment modeling and enables consistent policy refinement, yielding significantly higher success rates compared to imitation learning.
arXiv Detail & Related papers (2025-09-23T14:38:15Z) - WorldEval: World Model as Real-World Robot Policies Evaluator [13.899692171641066]
A key challenge is generating accurate policy videos from world models that faithfully reflect the robot actions.<n>We propose Policy2Vec, a simple yet effective approach to turn a video generation model into a world simulator that follows latent action to generate the robot video.<n>We then introduce WorldEval, an automated pipeline designed to evaluate real-world robot policies entirely online.
arXiv Detail & Related papers (2025-05-25T07:41:39Z) - Real-is-Sim: Bridging the Sim-to-Real Gap with a Dynamic Digital Twin for Real-World Robot Policy Evaluation [8.36634439225698]
We propose real-is-sim, a behavior cloning framework that incorporates a dynamic digital twin throughout the policy development pipeline.<n>We validate real-is-sim on the PushT manipulation task, demonstrating strong correlation between success rates obtained in the simulator and real-world evaluations.
arXiv Detail & Related papers (2025-04-04T17:05:56Z) - Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics [50.191655141020505]
This work advances model-based reinforcement learning by addressing the challenges of long-horizon prediction, error accumulation, and sim-to-real transfer.<n>By providing a scalable and robust framework, the introduced methods pave the way for adaptive and efficient robotic systems in real-world applications.
arXiv Detail & Related papers (2025-01-17T10:39:09Z) - WorldSimBench: Towards Video Generation Models as World Simulators [79.69709361730865]
We classify the functionalities of predictive models into a hierarchy and take the first step in evaluating World Simulators by proposing a dual evaluation framework called WorldSimBench.
WorldSimBench includes Explicit Perceptual Evaluation and Implicit Manipulative Evaluation, encompassing human preference assessments from the visual perspective and action-level evaluations in embodied tasks.
Our comprehensive evaluation offers key insights that can drive further innovation in video generation models, positioning World Simulators as a pivotal advancement toward embodied artificial intelligence.
arXiv Detail & Related papers (2024-10-23T17:56:11Z) - GRAPPA: Generalizing and Adapting Robot Policies via Online Agentic Guidance [15.774237279917594]
We propose an agentic framework for robot self-guidance and self-improvement.<n>Our framework iteratively grounds a base robot policy to relevant objects in the environment.<n>We demonstrate that our approach can effectively guide manipulation policies to achieve significantly higher success rates.
arXiv Detail & Related papers (2024-10-09T02:00:37Z) - IRASim: A Fine-Grained World Model for Robot Manipulation [24.591694756757278]
We present IRASim, a novel world model capable of generating videos with fine-grained robot-object interaction details.<n>We train a diffusion transformer and introduce a novel frame-level action-conditioning module within each transformer block to explicitly model and strengthen the action-frame alignment.
arXiv Detail & Related papers (2024-06-20T17:50:16Z) - Evaluating Real-World Robot Manipulation Policies in Simulation [91.55267186958892]
Control and visual disparities between real and simulated environments are key challenges for reliable simulated evaluation.
We propose approaches for mitigating these gaps without needing to craft full-fidelity digital twins of real-world environments.
We create SIMPLER, a collection of simulated environments for manipulation policy evaluation on common real robot setups.
arXiv Detail & Related papers (2024-05-09T17:30:16Z) - Robust Visual Sim-to-Real Transfer for Robotic Manipulation [79.66851068682779]
Learning visuomotor policies in simulation is much safer and cheaper than in the real world.
However, due to discrepancies between the simulated and real data, simulator-trained policies often fail when transferred to real robots.
One common approach to bridge the visual sim-to-real domain gap is domain randomization (DR)
arXiv Detail & Related papers (2023-07-28T05:47:24Z) - Nonprehensile Riemannian Motion Predictive Control [57.295751294224765]
We introduce a novel Real-to-Sim reward analysis technique to reliably imagine and predict the outcome of taking possible actions for a real robotic platform.
We produce a closed-loop controller to reactively push objects in a continuous action space.
We observe that RMPC is robust in cluttered as well as occluded environments and outperforms the baselines.
arXiv Detail & Related papers (2021-11-15T18:50:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.