Related papers: WMPO: World Model-based Policy Optimization for Vision-Language-Action Models

WMPO: World Model-based Policy Optimization for Vision-Language-Action Models

URL: http://arxiv.org/abs/2511.09515v1
Date: Thu, 13 Nov 2025 01:59:17 GMT
Title: WMPO: World Model-based Policy Optimization for Vision-Language-Action Models
Authors: Fangqi Zhu, Zhengyang Yan, Zicong Hong, Quanxin Shou, Xiao Ma, Song Guo,
Abstract summary: Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation.<n>We introduce World-Model-based Policy Optimization (WMPO), a principled framework for on-policy VLA without interacting with the real environment.
Score: 22.01666177489494
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation, but their reliance on expert demonstrations limits their ability to learn from failures and perform self-corrections. Reinforcement learning (RL) addresses these through self-improving interactions with the physical environment, but suffers from high sample complexity on real robots. We introduce World-Model-based Policy Optimization (WMPO), a principled framework for on-policy VLA RL without interacting with the real environment. In contrast to widely used latent world models, WMPO focuses on pixel-based predictions that align the "imagined" trajectories with the VLA features pretrained with web-scale images. Crucially, WMPO enables the policy to perform on-policy GRPO that provides stronger performance than the often-used off-policy methods. Extensive experiments in both simulation and real-robot settings demonstrate that WMPO (i) substantially improves sample efficiency, (ii) achieves stronger overall performance, (iii) exhibits emergent behaviors such as self-correction, and (iv) demonstrates robust generalization and lifelong learning capabilities.

Related papers

Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [55.982504915794514]
We propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination.<n>SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines.
arXiv Detail & Related papers (2026-02-25T06:58:06Z)
WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL [30.884160045861616]
We propose WoVR, a reliable world-model-based reinforcement learning framework for post-training VLA policies.<n>It improves rollout stability through a controllable action-conditioned video world model.<n>It also reshapes imagined interaction to reduce effective error depth via Keyframe-evolutiond Rollouts.
arXiv Detail & Related papers (2026-02-15T03:48:20Z)
From Word to World: Can Large Language Models be Implicit Text-based World Models? [82.47317196099907]
Agentic reinforcement learning increasingly relies on experience-driven scaling.<n>World models offer a potential way to improve learning efficiency through simulated experience.<n>We study whether large language models can reliably serve this role and under what conditions they meaningfully benefit agents.
arXiv Detail & Related papers (2025-12-21T17:28:42Z)
Learning Generalizable Visuomotor Policy through Dynamics-Alignment [13.655111993491674]
Recent approaches leveraging video prediction models have shown promising results by learning rich representations from large-scale datasets.<n>We propose a Dynamics-Aligned Flow Matching Policy (DAP) that integrates dynamics prediction into policy learning.<n>Our method introduces a novel architecture where policy and dynamics models provide mutual corrective feedback during action generation, enabling self-correction and improved generalization.
arXiv Detail & Related papers (2025-10-31T02:29:33Z)
Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition [52.232968183793986]
General Policy Composition (GPC) is a training-free method that enhances performance by combining the distributional scores of multiple pre-trained policies.<n>GPC consistently improves performance and adaptability across a diverse set of tasks.
arXiv Detail & Related papers (2025-10-01T16:05:53Z)
Unified Vision-Language-Action Model [86.68814779303429]
We present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences.<n>Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge.<n>We further demonstrate its broad applicability on real-world ALOHA manipulation and autonomous driving.
arXiv Detail & Related papers (2025-06-24T17:59:57Z)
ROSA: Harnessing Robot States for Vision-Language and Action Alignment [24.426285156386715]
Vision-Language Models (VLMs) have made significant advance in end-to-end robotic control.<n>We propose a novel training paradigm, ROSA, which leverages robot state estimation to improve alignment between vision-language and action spaces.
arXiv Detail & Related papers (2025-06-16T16:34:20Z)
LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation [45.02469804709771]
We propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling.<n>We show that LaDi-WM significantly enhances policy performance by 27.9% on the LIBERO-LONG benchmark and 20% on the real-world scenario.
arXiv Detail & Related papers (2025-05-13T04:42:14Z)
Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators.<n>To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module.<n> Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z)
Strengthening Generative Robot Policies through Predictive World Modeling [25.45350191178106]
generative predictive control (GPC) is a learning control framework that clones a generative diffusion-based policy from expert demonstrations.<n>GPC consistently outperforms behavior cloning in both state-based and vision-based settings.
arXiv Detail & Related papers (2025-02-02T01:21:19Z)
Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics [50.191655141020505]
This work advances model-based reinforcement learning by addressing the challenges of long-horizon prediction, error accumulation, and sim-to-real transfer.<n>By providing a scalable and robust framework, the introduced methods pave the way for adaptive and efficient robotic systems in real-world applications.
arXiv Detail & Related papers (2025-01-17T10:39:09Z)
ReCoRe: Regularized Contrastive Representation Learning of World Model [21.29132219042405]
We present a world model that learns invariant features using contrastive unsupervised learning and an intervention-invariant regularizer. Our method outperforms current state-of-the-art model-based and model-free RL methods and significantly improves on out-of-distribution point navigation tasks evaluated on the iGibson benchmark.
arXiv Detail & Related papers (2023-12-14T15:53:07Z)
Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning [5.09191791549438]
Recent works have achieved state-of-the-art results in several of the mostly deterministic offline Atari and D4RL benchmarks. We propose a method that addresses this optimism bias by explicitly disentangling the policy and world models. We demonstrate our method's superior performance on a variety of autonomous driving tasks in simulation.
arXiv Detail & Related papers (2022-07-21T04:12:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.