Fugu-MT 論文翻訳(概要): Scaling World Model for Hierarchical Manipulation Policies

論文の概要: Scaling World Model for Hierarchical Manipulation Policies

arxiv url: http://arxiv.org/abs/2602.10983v2
Date: Thu, 12 Feb 2026 10:16:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:41.371211
Title: Scaling World Model for Hierarchical Manipulation Policies
Title（参考訳）: 階層型マニピュレーションポリシのスケールワールドモデル
Authors: Qian Long, Yueze Wang, Jiaxi Song, Junbo Zhang, Peiyan Li, Wenxuan Wang, Yuqi Wang, Haoyang Li, Shaoxuan Xie, Guocai Yao, Hanbo Zhang, Xinlong Wang, Zhongyuan Wang, Xuguang Lan, Huaping Liu, Xinghang Li,
Abstract要約: Vision-Language-Action(VLA)モデルは、汎用的なロボット操作を約束するが、配布外設定では脆弱である。本稿では,大規模事前学習型世界モデルの一般化を活用した階層型ビジョン・ランゲージ・アクション・フレームワークを提案する。視覚目標合成と階層型VLAポリシの両方を,大規模なアウト・オブ・ディストリビューションシナリオで検証する。
参考スコア（独自算出の注目度）: 61.736772957803026
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Vision-Language-Action (VLA) models are promising for generalist robot manipulation but remain brittle in out-of-distribution (OOD) settings, especially with limited real-robot data. To resolve the generalization bottleneck, we introduce a hierarchical Vision-Language-Action framework \our{} that leverages the generalization of large-scale pre-trained world model for robust and generalizable VIsual Subgoal TAsk decomposition VISTA. Our hierarchical framework \our{} consists of a world model as the high-level planner and a VLA as the low-level executor. The high-level world model first divides manipulation tasks into subtask sequences with goal images, and the low-level policy follows the textual and visual guidance to generate action sequences. Compared to raw textual goal specification, these synthesized goal images provide visually and physically grounded details for low-level policies, making it feasible to generalize across unseen objects and novel scenarios. We validate both visual goal synthesis and our hierarchical VLA policies in massive out-of-distribution scenarios, and the performance of the same-structured VLA in novel scenarios could boost from 14% to 69% with the guidance generated by the world model. Results demonstrate that our method outperforms previous baselines with a clear margin, particularly in out-of-distribution scenarios. Project page: \href{https://vista-wm.github.io/}{https://vista-wm.github.io}
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは、一般的なロボット操作には有望だが、特に実際のロボットデータに制限のあるオフ・オブ・ディストリビューション(OOD)設定では不安定である。一般化ボトルネックを解決するために,大規模な事前学習型世界モデルの一般化を利用した階層型視覚・言語・アクションフレームワーク \our{} を導入し,より堅牢で一般化可能な視覚サブゴール TAsk 分解 VISTA を提案する。我々の階層的なフレームワーク \our{} は、高レベルのプランナーとしての世界モデルと、低レベルの実行者としてのVLAで構成されています。高レベル世界モデルは、まず、操作タスクを目標画像とサブタスクシーケンスに分割し、低レベルポリシーは、アクションシーケンスを生成するためのテキストおよび視覚的ガイダンスに従う。生のテキストの目標仕様と比較すると、これらの合成されたゴール画像は、低レベルポリシーの視覚的および物理的基盤的な詳細を提供するため、目に見えないオブジェクトや新しいシナリオをまたいで一般化することが可能である。視覚的目標合成と階層的VLAポリシーの両方を大局的なアウト・オブ・ディストリビューションシナリオで検証し、新しいシナリオにおける同一構造のVLAの性能は、世界モデルによるガイダンスによって14%から69%向上する可能性がある。その結果,本手法は,特にアウト・オブ・ディストリビューションシナリオにおいて,より明確なマージンで,従来のベースラインよりも優れていることが示された。プロジェクトページ: \href{https://vista-wm.github.io/}{https://vista-wm.github.io}

論文の概要: Scaling World Model for Hierarchical Manipulation Policies

関連論文リスト