WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL
- URL: http://arxiv.org/abs/2602.13977v1
- Date: Sun, 15 Feb 2026 03:48:20 GMT
- Title: WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL
- Authors: Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, Yu Wang, Haoran Li, Chao Yu, Dongbin Zhao,
- Abstract summary: We propose WoVR, a reliable world-model-based reinforcement learning framework for post-training VLA policies.<n>It improves rollout stability through a controllable action-conditioned video world model.<n>It also reshapes imagined interaction to reduce effective error depth via Keyframe-evolutiond Rollouts.
- Score: 30.884160045861616
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning (RL) promises to unlock capabilities beyond imitation learning for Vision-Language-Action (VLA) models, but its requirement for massive real-world interaction prevents direct deployment on physical robots. Recent work attempts to use learned world models as simulators for policy optimization, yet closed-loop imagined rollouts inevitably suffer from hallucination and long-horizon error accumulation. Such errors do not merely degrade visual fidelity; they corrupt the optimization signal, encouraging policies to exploit model inaccuracies rather than genuine task progress. We propose WoVR, a reliable world-model-based reinforcement learning framework for post-training VLA policies. Instead of assuming a faithful world model, WoVR explicitly regulates how RL interacts with imperfect imagined dynamics. It improves rollout stability through a controllable action-conditioned video world model, reshapes imagined interaction to reduce effective error depth via Keyframe-Initialized Rollouts, and maintains policy-simulator alignment through World Model-Policy co-evolution. Extensive experiments on LIBERO benchmarks and real-world robotic manipulation demonstrate that WoVR enables stable long-horizon imagined rollouts and effective policy optimization, improving average LIBERO success from 39.95% to 69.2% (+29.3 points) and real-robot success from 61.7% to 91.7% (+30.0 points). These results show that learned world models can serve as practical simulators for reinforcement learning when hallucination is explicitly controlled.
Related papers
- Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [55.982504915794514]
We propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination.<n>SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines.
arXiv Detail & Related papers (2026-02-25T06:58:06Z) - World-Gymnast: Training Robots with Reinforcement Learning in a World Model [4.491505634160759]
We propose World-Gymnast, which performs RL finetuning of a vision-language-action policy by rolling out the policy in an action-conditioned video world model.<n>On the Bridge robot setup, World-Gymnast outperforms SFT by as much as 18x and outperforms software simulator by as much as 2x.<n>Our results suggest learning a world model and training robot policies in the cloud could be the key to bridging the gap between robots that work in demonstrations and robots that can work in anyone's household.
arXiv Detail & Related papers (2026-02-02T18:44:45Z) - Failure-Aware RL: Reliable Offline-to-Online Reinforcement Learning with Self-Recovery for Real-World Manipulation [48.26705293834693]
Failure-Aware Offline-to-Online Reinforcement Learning (FARL) is a new paradigm minimizing failures during real-world reinforcement learning.<n>We propose an algorithm that integrates a world-model-based safety critic and a recovery policy trained offline to prevent failures during online exploration.
arXiv Detail & Related papers (2026-01-12T18:53:11Z) - WMPO: World Model-based Policy Optimization for Vision-Language-Action Models [22.01666177489494]
Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation.<n>We introduce World-Model-based Policy Optimization (WMPO), a principled framework for on-policy VLA without interacting with the real environment.
arXiv Detail & Related papers (2025-11-12T17:54:09Z) - Ctrl-World: A Controllable Generative World Model for Robot Manipulation [53.71061464925014]
Generalist robot policies can perform a wide range of manipulation skills.<n> evaluating and improving their ability with unfamiliar objects and instructions remains a significant challenge.<n>World models offer a promising, scalable alternative by enabling policies to rollout within imagination space.
arXiv Detail & Related papers (2025-10-11T09:13:10Z) - VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators [38.880852900641]
Vision-Language-Action (VLA) models enable embodied decision-making but rely heavily on imitation learning.<n>We introduce VLA-RFT, a reinforcement fine-tuning framework that leverages a data-driven world model as a controllable simulator.<n>With fewer than 400 fine-tuning steps, VLA-RFT surpasses strong supervised baselines and achieves greater efficiency than simulator-based RL.
arXiv Detail & Related papers (2025-10-01T01:33:10Z) - World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation [23.270985761700203]
We propose World4RL, a framework that employs diffusion-based world models as high-fidelity simulators to refine pre-trained policies for robotic manipulation.<n>World4RL provides high-fidelity environment modeling and enables consistent policy refinement, yielding significantly higher success rates compared to imitation learning.
arXiv Detail & Related papers (2025-09-23T14:38:15Z) - WorldGym: World Model as An Environment for Policy Evaluation [41.204900701616914]
WorldGym is an autoregressive, action-conditioned video generation model which serves as a proxy to real world environments.<n> Policies are evaluated via Monte Carlo rollouts in the world model, with a vision-language model providing rewards.<n>We show that WorldGym is able to preserve relative policy rankings across different policy versions, sizes, and training checkpoints.
arXiv Detail & Related papers (2025-05-31T15:51:56Z) - Video-Enhanced Offline Reinforcement Learning: A Model-Based Approach [55.76249793590689]
Video-Enhanced Offline RL (VeoRL) is a model-based method that constructs an interactive world model from diverse, unlabeled video data readily available online.<n>VeoRL achieves substantial performance gains across visual control tasks in robotic manipulation, autonomous driving, and open-world video games.
arXiv Detail & Related papers (2025-05-10T00:54:12Z) - Offline Robotic World Model: Learning Robotic Policies without a Physics Simulator [50.191655141020505]
Reinforcement Learning (RL) has demonstrated impressive capabilities in robotic control but remains challenging due to high sample complexity, safety concerns, and the sim-to-real gap.<n>We introduce Offline Robotic World Model (RWM-O), a model-based approach that explicitly estimates uncertainty to improve policy learning without reliance on a physics simulator.
arXiv Detail & Related papers (2025-04-23T12:58:15Z) - Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators.<n>To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module.<n> Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z) - Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics [50.191655141020505]
This work advances model-based reinforcement learning by addressing the challenges of long-horizon prediction, error accumulation, and sim-to-real transfer.<n>By providing a scalable and robust framework, the introduced methods pave the way for adaptive and efficient robotic systems in real-world applications.
arXiv Detail & Related papers (2025-01-17T10:39:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.