Fugu-MT 論文翻訳(概要): WMPO: World Model-based Policy Optimization for Vision-Language-Action Models

論文の概要: WMPO: World Model-based Policy Optimization for Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2511.09515v1
Date: Thu, 13 Nov 2025 01:59:17 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-13 22:34:54.605813
Title: WMPO: World Model-based Policy Optimization for Vision-Language-Action Models
Title（参考訳）: WMPO:ビジョンランゲージ・アクションモデルのための世界モデルに基づく政策最適化
Authors: Fangqi Zhu, Zhengyang Yan, Zicong Hong, Quanxin Shou, Xiao Ma, Song Guo,
Abstract要約: VLA(Vision-Language-Action)モデルは汎用ロボット操作の強力な可能性を示している。 WMPO(World-Model-based Policy Optimization)は、実環境と対話することなく、オンラインVLAのための原則的フレームワークである。
参考スコア（独自算出の注目度）: 22.01666177489494
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation, but their reliance on expert demonstrations limits their ability to learn from failures and perform self-corrections. Reinforcement learning (RL) addresses these through self-improving interactions with the physical environment, but suffers from high sample complexity on real robots. We introduce World-Model-based Policy Optimization (WMPO), a principled framework for on-policy VLA RL without interacting with the real environment. In contrast to widely used latent world models, WMPO focuses on pixel-based predictions that align the "imagined" trajectories with the VLA features pretrained with web-scale images. Crucially, WMPO enables the policy to perform on-policy GRPO that provides stronger performance than the often-used off-policy methods. Extensive experiments in both simulation and real-robot settings demonstrate that WMPO (i) substantially improves sample efficiency, (ii) achieves stronger overall performance, (iii) exhibits emergent behaviors such as self-correction, and (iv) demonstrates robust generalization and lifelong learning capabilities.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは、汎用ロボット操作の強力な可能性を示しているが、専門家によるデモンストレーションへの依存は、失敗から学び、自己補正を行う能力を制限している。強化学習(RL)は、物理的環境との自己改善的な相互作用を通じてこれらに対処するが、実際のロボットでは高いサンプルの複雑さに悩まされる。 WMPO(World-Model-based Policy Optimization)は、実環境と対話することなく、オンラインVLA RLの原則的フレームワークである。広く使われている潜在世界モデルとは対照的に、WMPOは「想像された」軌跡とWebスケールの画像で事前訓練されたVLA特徴とを一致させるピクセルベースの予測に焦点を当てている。重要な点として、WMPOは、しばしば使用されるオフ・ポリティィ法よりも強力なパフォーマンスを提供する、オン・ポリティィGRPOの実行を可能にする。 WMPOのシミュレーションと実ロボット設定における大規模な実験 i) 試料効率を大幅に改善する。 (ii)全体的なパフォーマンスが向上する。 (三)自己訂正等の突発的な行動を示すこと、 (iv)は、堅牢な一般化と生涯学習能力を示す。

論文の概要: WMPO: World Model-based Policy Optimization for Vision-Language-Action Models

関連論文リスト