Fugu-MT 論文翻訳(概要): GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors

論文の概要: GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors

arxiv url: http://arxiv.org/abs/2606.05160v1
Date: Wed, 03 Jun 2026 17:57:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 20:44:18.950855
Title: GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors
Title（参考訳）: GRAIL:3Dアセットとビデオプリミティブからヒューマノイドロコマニピュレーションを生成する
Authors: Tianyi Xie, Haotian Zhang, Jinhyung Park, Zi Wang, Bowen Wen, Jiefeng Li, Xueting Li, Qingwei Ben, Haoyang Weng, Yufei Ye, David Minor, Tingwu Wang, Chenfanfu Jiang, Sanja Fidler, Jan Kautz, Linxi Fan, Yuke Zhu, Zhengyi Luo, Umar Iqbal, Ye Yuan,
Abstract要約: GRAILは3Dアセット、シミュレーター対応シーン、およびビデオファンデーションモデル(VFM)の先行データで構成され、物理的環境を再構築したりロボットを遠隔操作したりすることなく対話を合成するデジタル生成パイプラインである。 GRAILは、オブジェクト形状、カメラパラメータ、メートル法スケール、環境深度、ロボットが提案する文字がビデオ生成の前に知られ、再構成中に再利用される、完全に定義された3D構成から始まる。我々は、回復した動作をヒューマノイドロボットに再ターゲティングし、補完的なタスク・ジェネラル・モルフォロジー・トラッカーを訓練する。 GRAILは、ピックアップ、オブジェクト操作、着座にまたがる2万以上のシーケンスを生成する
参考スコア（独自算出の注目度）: 113.71148915419246
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling humanoid loco-manipulation requires robot-compatible demonstrations across diverse objects, whole-body motions, and scene geometries, but teleoperation and motion capture are difficult to scale because each collection depends on physical setups, instrumented actors, and robot operation. We present GRAIL, a digital generation pipeline that remains fully virtual until deployment: it composes 3D assets, simulator-ready scenes, and priors from video foundation models (VFMs) to synthesize interactions without rebuilding physical environments or teleoperating the robot. Rather than reconstructing unconstrained in-the-wild videos, GRAIL starts from fully specified 3D configurations in which object geometry, camera parameters, metric scale, environment depth, and a robot-proportioned character are known before video generation and reused during reconstruction. This privileged setup better conditions 4D recovery, allowing model-based object tracking, human motion estimation, and interaction-aware optimization to reconstruct metric 4D human-object interaction (HOI) trajectories with reduced depth ambiguity and morphology mismatch. We retarget the recovered motions to a humanoid robot and train complementary task-general trackers: an object-aware latent adaptor for manipulation and a scene-aware tracker for terrain traversal. GRAIL produces over 20,000 sequences spanning pick-up, object manipulation, sitting, and terrain traversal. Using only GRAIL-generated data, we train egocentric visual policies through a sim-to-real pipeline and deploy them on a Unitree G1 humanoid, achieving 84\% real-world success on diverse object pick-up and 90\% success on stair-climbing.
Abstract（参考訳）: ヒューマノイドのロコ操作のスケーリングには、多様な物体、全身の動き、シーンのジオメトリーをまたいだロボット互換のデモが必要であるが、遠隔操作とモーションキャプチャーは、各コレクションが物理的なセットアップ、計測されたアクター、ロボット操作に依存するため、スケールが困難である。物理環境を再構築したりロボットを遠隔操作したりすることなく、インタラクションを合成するために、ビデオファンデーションモデル(VFM)から3Dアセット、シミュレーター対応シーン、および先行情報を合成する。 GRAILは、未制約のビデオの再構成ではなく、オブジェクト幾何学、カメラパラメータ、メートル法スケール、環境深度、ロボットが提案するキャラクターがビデオ生成前に知られ、再構成中に再利用されるような、完全に指定された3D構成から始める。この特権設定により、モデルに基づく物体追跡、人の動き推定、相互作用対応の最適化により、深度あいまいさと形態的ミスマッチを低減した計量4D人物相互作用(HOI)軌道を再構築することができる。我々は、回復した動きをヒューマノイドロボットに再ターゲティングし、補完的なタスク・ジェネラル・トラッカーを訓練する。 GRAILは、ピックアップ、オブジェクト操作、着座、地形トラバーサルにまたがる2万以上のシーケンスを生成する。 GRAIL生成データのみを使用して、シミュレート・トゥ・リアルパイプラインを通じてエゴセントリックな視覚ポリシーをトレーニングし、Unitree G1ヒューマノイド上にデプロイし、多様なオブジェクトのピックアップで85%、階段登りで90%の成功を達成します。

論文の概要: GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors

関連論文リスト