Fugu-MT 論文翻訳(概要): EmbodiSwap for Zero-Shot Robot Imitation Learning

論文の概要: EmbodiSwap for Zero-Shot Robot Imitation Learning

arxiv url: http://arxiv.org/abs/2510.03706v1
Date: Sat, 04 Oct 2025 07:11:20 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.206726
Title: EmbodiSwap for Zero-Shot Robot Imitation Learning
Title（参考訳）: ゼロショットロボット模倣学習のためのEmbodiSwap
Authors: Eadom Dessalene, Pavan Mantripragada, Michael Maynord, Yiannis Aloimonos,
Abstract要約: EmbodiSwapは、人間のビデオ上で合成ロボットをオーバーレイする手法である。我々はEmbodiSwapをゼロショットの模倣学習に利用し、Wild Ego中心の人間ビデオとターゲットロボットのエンボディメントとの間のエンボディメントギャップを埋める。我々は,V-JEPAを視覚バックボーンとして,ビデオ理解の領域から,合成ロボットビデオによる模倣学習へと再開発する。
参考スコア（独自算出の注目度）: 16.98296957464262
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce EmbodiSwap - a method for producing photorealistic synthetic robot overlays over human video. We employ EmbodiSwap for zero-shot imitation learning, bridging the embodiment gap between in-the-wild ego-centric human video and a target robot embodiment. We train a closed-loop robot manipulation policy over the data produced by EmbodiSwap. We make novel use of V-JEPA as a visual backbone, repurposing V-JEPA from the domain of video understanding to imitation learning over synthetic robot videos. Adoption of V-JEPA outperforms alternative vision backbones more conventionally used within robotics. In real-world tests, our zero-shot trained V-JEPA model achieves an $82\%$ success rate, outperforming a few-shot trained $\pi_0$ network as well as $\pi_0$ trained over data produced by EmbodiSwap. We release (i) code for generating the synthetic robot overlays which takes as input human videos and an arbitrary robot URDF and generates a robot dataset, (ii) the robot dataset we synthesize over EPIC-Kitchens, HOI4D and Ego4D, and (iii) model checkpoints and inference code, to facilitate reproducible research and broader adoption.
Abstract（参考訳）: EmbodiSwapは、人間のビデオ上で光リアルな合成ロボットをオーバーレイする手法である。我々はEmbodiSwapをゼロショットの模倣学習に利用し、Wild Ego中心の人間ビデオとターゲットロボットのエンボディメントとの間のエンボディメントギャップを埋める。我々は、EmbodiSwapが作成したデータに対して、クローズドループロボット操作ポリシーを訓練する。我々は,V-JEPAを視覚バックボーンとして,ビデオ理解の領域から,合成ロボットビデオによる模倣学習へと再開発する。 V-JEPAの採用は、従来ロボット工学で用いられてきた視覚バックボーンよりも優れている。実世界のテストでは、ゼロショットトレーニングされたV-JEPAモデルは、EmbodiSwapが生成したデータに対してトレーニングされた$\pi_0$ネットワークと、数ショットトレーニングされた$\pi_0$ネットワークを上回り、成功率を82.5%で達成しています。リリース一入力された人間ビデオ及び任意のロボットURDFをオーバーレイし、ロボットデータセットを生成する合成ロボット生成コード。 (II)EPIC-Kitchens,HOI4D,Ego4Dで合成したロボットデータセット三再現可能な研究及びより広範な採用を促進するためのモデルチェックポイント及び推論コード

論文の概要: EmbodiSwap for Zero-Shot Robot Imitation Learning

関連論文リスト