Fugu-MT 論文翻訳(概要): Morphology-Consistent Humanoid Interaction through Robot-Centric Video Synthesis

論文の概要: Morphology-Consistent Humanoid Interaction through Robot-Centric Video Synthesis

arxiv url: http://arxiv.org/abs/2603.19709v2
Date: Tue, 24 Mar 2026 08:42:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 12:42:17.581773
Title: Morphology-Consistent Humanoid Interaction through Robot-Centric Video Synthesis
Title（参考訳）: ロボット中心のビデオ合成による形態・持続型ヒューマノイド相互作用
Authors: Weisheng Xu, Jian Li, Yi Gu, Bin Yang, Haodong Chen, Shuyi Lin, Mingqian Zhou, Jing Tan, Qiwei Wu, Xiangrui Jiang, Taowen Wang, Jiawen Wen, Qiwei Liang, Jiaxi Zhang, Renjing Xu,
Abstract要約: Dream2Actは、生成ビデオによるゼロショットインタラクションを可能にするロボット中心のフレームワークである。 Dream2Actは、ロボットネイティブ空間内で厳密に動作し、エラーを回避し、タスク固有のポリシートレーニングを取り除く。
参考スコア（独自算出の注目度）: 25.249184346335557
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Equipping humanoid robots with versatile interaction skills typically requires either extensive policy training or explicit human-to-robot motion retargeting. However, learning-based policies face prohibitive data collection costs. Meanwhile, retargeting relies on human-centric pose estimation (e.g., SMPL), introducing a morphology gap. Skeletal scale mismatches result in severe spatial misalignments when mapped to robots, compromising interaction success. In this work, we propose Dream2Act, a robot-centric framework enabling zero-shot interaction through generative video synthesis. Given a third-person image of the robot and target object, our framework leverages video generation models to envision the robot completing the task with morphology-consistent motion. We employ a high-fidelity pose extraction system to recover physically feasible, robot-native joint trajectories from these synthesized dreams, subsequently executed via a general-purpose whole-body controller. Operating strictly within the robot-native coordinate space, Dream2Act avoids retargeting errors and eliminates task-specific policy training. We evaluate Dream2Act on the Unitree G1 across four whole-body mobile interaction tasks: ball kicking, sofa sitting, bag punching, and box hugging. Dream2Act achieves a 37.5% overall success rate, compared to 0% for conventional retargeting. While retargeting fails to establish correct physical contacts due to the morphology gap (with errors compounded during locomotion), Dream2Act maintains robot-consistent spatial alignment, enabling reliable contact formation and substantially higher task completion.
Abstract（参考訳）: 万能なインタラクションスキルを持つヒューマノイドロボットを入手するには、広範なポリシートレーニングまたは明示的なヒューマン・ロボット・モーション・リターゲティングが必要になる。しかし、学習ベースのポリシーはデータ収集の禁止コストに直面している。一方、リターゲティングは人間中心のポーズ推定(SMPLなど)に依存しており、形態的ギャップが生じる。骨格的スケールのミスマッチは、ロボットにマッピングされた際の空間的ミスアライメントを悪化させ、相互作用の成功を損なう。本研究では,生成ビデオ合成によるゼロショットインタラクションを実現するロボット中心のフレームワークであるDream2Actを提案する。ロボットと対象物体の3人称画像が与えられた場合、このフレームワークは映像生成モデルを利用して、形態に一貫性のある動作でタスクを完了することを想定する。我々は,高忠実性ポーズ抽出システムを用いて,これらの合成された夢から身体的に実現可能な,ロボットネイティブな関節軌道を復元し,その後,汎用的な全身制御装置を用いて実行した。ロボットネイティブの座標空間内で厳格に運用されているDream2Actは、エラーの再ターゲティングを回避し、タスク固有のポリシートレーニングを排除する。我々は,Unitree G1のDream2Actを,ボール蹴り,ソファー座,バッグパンチ,ボックスハグの4つのモバイルインタラクションタスクで評価した。 Dream2Actの全体的な成功率は37.5%であり、従来のリターゲティングでは0%である。リターゲティングは、(移動中に複雑なエラーを伴って)形態的ギャップによって正しい物理的接触を確立するのに失敗するが、Dream2Actはロボットと一貫性のある空間アライメントを維持し、信頼性の高い接触形成を可能にし、タスク完了を大幅に高める。

論文の概要: Morphology-Consistent Humanoid Interaction through Robot-Centric Video Synthesis

関連論文リスト