Fugu-MT 論文翻訳(概要): Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training

論文の概要: Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training

arxiv url: http://arxiv.org/abs/2604.21741v1
Date: Thu, 23 Apr 2026 14:42:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-24 14:40:06.603571
Title: Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training
Title（参考訳）: Hi-WM: スケーラブルロボットのポストトレーニングにおけるヒューマン・イン・ザ・ワールドモデル
Authors: Yaxuan Li, Zhongyi Zhou, Yefei Chen, Yanjiang Guo, Jiaming Liu, Shanghang Zhang, Jianyu Chen, Yichen Zhu,
Abstract要約: 本稿では,学習世界モデルを用いた学習後学習フレームワークを提案する。 Hi-WMは中間状態をキャッシュし、ロールバックとブランチをサポートする。我々は、剛性と変形性のあるオブジェクト相互作用と2つのポリシーバックボーンにまたがる3つの実世界の操作タスクについて、Hi-WMを評価する。
参考スコア（独自算出の注目度）: 54.896907620476675
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Post-training is essential for turning pretrained generalist robot policies into reliable task-specific controllers, but existing human-in-the-loop pipelines remain tied to physical execution: each correction requires robot time, scene setup, resets, and operator supervision in the real world. Meanwhile, action-conditioned world models have been studied mainly for imagination, synthetic data generation, and policy evaluation. We propose \textbf{Human-in-the-World-Model (Hi-WM)}, a post-training framework that uses a learned world model as a reusable corrective substrate for failure-targeted policy improvement. A policy is first rolled out in closed loop inside the world model; when the rollout becomes incorrect or failure-prone, a human intervenes directly in the model to provide short corrective actions. Hi-WM caches intermediate states and supports rollback and branching, allowing a single failure state to be reused for multiple corrective continuations and yielding dense supervision around behaviors that the base policy handles poorly. The resulting corrective trajectories are then added back to the training set for post-training. We evaluate Hi-WM on three real-world manipulation tasks spanning both rigid and deformable object interaction, and on two policy backbones. Hi-WM improves real-world success by 37.9 points on average over the base policy and by 19.0 points over a world-model closed-loop baseline, while world-model evaluation correlates strongly with real-world performance (r = 0.953). These results suggest that world models can serve not only as generators or evaluators, but also as effective corrective substrates for scalable robot post-training.
Abstract（参考訳）: ポストトレーニングは、訓練済みのジェネラリストロボットポリシーを、信頼性の高いタスク固有のコントローラに変換するために不可欠であるが、既存のヒューマン・イン・ザ・ループパイプラインは、実際の世界でロボット時間、シーン設定、リセット、オペレーターの監督を必要とする。一方、行動条件付き世界モデルは、主に想像力、合成データ生成、政策評価のために研究されている。本稿では,学習世界モデルを用いた学習後学習フレームワークである「Hi-WM」について,障害対象の政策改善のための再利用可能な修正基板として提案する。ポリシーはまず、世界モデル内のクローズドループで展開され、ロールアウトが誤ったり失敗が生じたりすると、人間がモデルに直接介入して短い修正アクションを提供する。 Hi-WMは中間状態をキャッシュし、ロールバックとブランチをサポートし、単一の障害状態を複数の修正継続のために再利用し、基本ポリシーがうまく扱えない動作に関する厳密な監視を可能にする。結果として得られた補正軌道は、トレーニング後のトレーニングセットに追加される。我々は、剛性と変形性のあるオブジェクト相互作用と2つのポリシーバックボーンにまたがる3つの実世界の操作タスクについて、Hi-WMを評価する。 Hi-WMは、ベースポリシーよりも平均37.9ポイント、ワールドモデルクローズドループベースラインより平均19.0ポイント、ワールドモデル評価は現実世界のパフォーマンスと強く相関する(r = 0.953)。これらの結果は、世界モデルがジェネレータや評価器としてだけでなく、スケーラブルなロボットのポストトレーニングのための効果的な修正基板としても機能することを示唆している。

論文の概要: Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training

関連論文リスト