Fugu-MT 論文翻訳(概要): $Ψ_0$: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation

論文の概要: $Ψ_0$: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation

arxiv url: http://arxiv.org/abs/2603.12263v1
Date: Thu, 12 Mar 2026 17:59:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-21 18:33:56.733272
Title: $Ψ_0$: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation
Title（参考訳）: ユニバーサルヒューマノイド・ロコ・マニピュレーションに向けたオープン・ファンデーション・モデル
Authors: Songlin Wei, Hongyi Jing, Boqian Li, Zhenyu Zhao, Jiageng Mao, Zhenhao Ni, Sicheng He, Jie Liu, Xiawei Liu, Kaidi Kang, Sheng Zang, Weiduo Yuan, Marco Pavone, Di Huang, Yue Wang,
Abstract要約: 本稿では,ヒューマノイドのロコ操作課題に対処するオープン基盤モデルを提案する。我々の研究は、重要だが見落とされがちなデータレシピを特定します。高品質な人間操作データに対する事前トレーニングと、ドメイン固有の実世界のヒューマノイド軌道のポストトレーニングにより、優れた性能が得られることを示す。
参考スコア（独自算出の注目度）: 39.811210435945924
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We introduce $Ψ_0$ (Psi-Zero), an open foundation model to address challenging humanoid loco-manipulation tasks. While existing approaches often attempt to address this fundamental problem by co-training on large and diverse human and humanoid data, we argue that this strategy is suboptimal due to the fundamental kinematic and motion disparities between humans and humanoid robots. Therefore, data efficiency and model performance remain unsatisfactory despite the considerable data volume. To address this challenge, \ours\;decouples the learning process to maximize the utility of heterogeneous data sources. Specifically, we propose a staged training paradigm with different learning objectives: First, we autoregressively pre-train a VLM backbone on large-scale egocentric human videos to acquire generalizable visual-action representations. Then, we post-train a flow-based action expert on high-quality humanoid robot data to learn precise robot joint control. Our research further identifies a critical yet often overlooked data recipe: in contrast to approaches that scale with noisy Internet clips or heterogeneous cross-embodiment robot datasets, we demonstrate that pre-training on high-quality egocentric human manipulation data followed by post-training on domain-specific real-world humanoid trajectories yields superior performance. Extensive real-world experiments demonstrate that \ours\ achieves the best performance using only about 800 hours of human video data and 30 hours of real-world robot data, outperforming baselines pre-trained on more than 10$\times$ as much data by over 40\% in overall success rate across multiple tasks. We will open-source the entire ecosystem to the community, including a data processing and training pipeline, a humanoid foundation model, and a real-time action inference engine.
Abstract（参考訳）: 我々は,挑戦的なヒューマノイドのロコ操作タスクに対処するオープンファンデーションモデルである,99.0$(Psi-Zero)を紹介した。既存のアプローチでは、大規模で多様な人間とヒューマノイドのデータを共同で学習することで、この根本的な問題に対処しようとすることが多いが、人間とヒューマノイドのロボット間の基本的な運動的・運動的相違により、この戦略は最適以下であると我々は主張する。したがって、膨大なデータ量にもかかわらず、データ効率とモデル性能は相変わらず不満足である。この課題に対処するため、‘ours\;decoups the learning process to max the utility of heterogeneous data source。具体的には、異なる学習目標を持つステージドトレーニングパラダイムを提案する: まず、大規模なエゴセントリックな人間のビデオ上でVLMバックボーンを自己回帰的に事前訓練し、一般化可能な視覚行動表現を取得する。そこで我々は,高品質なヒューマノイドロボットデータに基づくフローベースのアクションエキスパートをポストトレーニングし,正確な関節制御を学習する。ノイズの多いインターネットクリップや異質なクロスエボディメントロボットデータセットとスケールするアプローチとは対照的に、高品質な人間中心の操作データに対する事前トレーニングと、ドメイン固有の実世界のヒューマノイド軌道のポストトレーニングは、優れたパフォーマンスをもたらすことを実証しています。大規模な実世界の実験により、‘ours’は人間のビデオデータ約800時間と現実世界のロボットデータ30時間で最高のパフォーマンスを達成し、10ドル以上で事前トレーニングされたベースラインを、複数のタスクで全体の成功率を40%以上上回った。データ処理とトレーニングパイプライン、ヒューマノイド基盤モデル、リアルタイムアクション推論エンジンなど、エコシステム全体をコミュニティにオープンソース化します。

論文の概要: $Ψ_0$: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation

関連論文リスト