Fugu-MT 論文翻訳(概要): Ctrl-World: A Controllable Generative World Model for Robot Manipulation

論文の概要: Ctrl-World: A Controllable Generative World Model for Robot Manipulation

arxiv url: http://arxiv.org/abs/2510.10125v1
Date: Sat, 11 Oct 2025 09:13:10 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 20:23:38.92451
Title: Ctrl-World: A Controllable Generative World Model for Robot Manipulation
Title（参考訳）: Ctrl-World:ロボット操作のための制御可能な生成可能世界モデル
Authors: Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, Chelsea Finn,
Abstract要約: 汎用ロボットポリシーは、幅広い操作スキルを実行することができる。未知の物体や命令で彼らの能力を評価し改善することは重要な課題です世界モデルは、イマジネーション空間内でポリシーの展開を可能にすることで、有望でスケーラブルな代替手段を提供する。
参考スコア（独自算出の注目度）: 53.71061464925014
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generalist robot policies can now perform a wide range of manipulation skills, but evaluating and improving their ability with unfamiliar objects and instructions remains a significant challenge. Rigorous evaluation requires a large number of real-world rollouts, while systematic improvement demands additional corrective data with expert labels. Both of these processes are slow, costly, and difficult to scale. World models offer a promising, scalable alternative by enabling policies to rollout within imagination space. However, a key challenge is building a controllable world model that can handle multi-step interactions with generalist robot policies. This requires a world model compatible with modern generalist policies by supporting multi-view prediction, fine-grained action control, and consistent long-horizon interactions, which is not achieved by previous works. In this paper, we make a step forward by introducing a controllable multi-view world model that can be used to evaluate and improve the instruction-following ability of generalist robot policies. Our model maintains long-horizon consistency with a pose-conditioned memory retrieval mechanism and achieves precise action control through frame-level action conditioning. Trained on the DROID dataset (95k trajectories, 564 scenes), our model generates spatially and temporally consistent trajectories under novel scenarios and new camera placements for over 20 seconds. We show that our method can accurately rank policy performance without real-world robot rollouts. Moreover, by synthesizing successful trajectories in imagination and using them for supervised fine-tuning, our approach can improve policy success by 44.7\%.
Abstract（参考訳）: 汎用的なロボットポリシーは、今や幅広い操作スキルを発揮できるが、不慣れなオブジェクトや命令でそれらの能力を評価し、改善することは、依然として大きな課題である。厳格な評価には多数の実世界のロールアウトが必要ですが、体系的な改善には専門家ラベルによる追加の修正データが必要です。これらのプロセスはどちらも遅く、コストがかかり、スケールが難しい。世界モデルは、イマジネーション空間内でポリシーの展開を可能にすることで、有望でスケーラブルな代替手段を提供する。しかし、重要な課題は、汎用的なロボットポリシーとマルチステップインタラクションを処理できるコントロール可能な世界モデルを構築することである。これは、マルチビュー予測、きめ細かいアクション制御、一貫性のあるロングホライゾン相互作用をサポートすることで、現代のジェネラリストポリシーと互換性のある世界モデルを必要とするが、これは以前の研究では達成されなかった。本稿では,汎用ロボットポリシーの指示追従能力の評価と改善に使用できる,制御可能な多視点世界モデルを導入することで,一歩進める。本モデルは,ポーズ条件付きメモリ検索機構との長期的整合性を維持し,フレームレベルの動作条件設定による高精度な動作制御を実現する。 DROIDデータセット(95k trajectories, 564 scene)に基づいて、新しいシナリオと20秒以上のカメラ配置の下で、空間的および時間的に一貫した軌跡を生成する。本研究では,実際のロボットのロールアウトを行なわずに,ポリシー性能を正確にランク付けできることを示す。さらに, 達成軌道を想像力で合成し, 教師付き微調整に利用することにより, 政策成功率を44.7%向上させることができる。

論文の概要: Ctrl-World: A Controllable Generative World Model for Robot Manipulation

関連論文リスト