Fugu-MT 論文翻訳(概要): GenHOI: Contact-Aware Humanoid-Object Interaction by Imitating Generated Videos without Task-Specific Training

論文の概要: GenHOI: Contact-Aware Humanoid-Object Interaction by Imitating Generated Videos without Task-Specific Training

arxiv url: http://arxiv.org/abs/2606.12995v1
Date: Thu, 11 Jun 2026 07:31:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-12 15:55:27.644437
Title: GenHOI: Contact-Aware Humanoid-Object Interaction by Imitating Generated Videos without Task-Specific Training
Title（参考訳）: Genhoi:タスク特化訓練なしで生成した映像を映し出す接触認識型ヒューマノイドオブジェクトインタラクション
Authors: Zhihai Bi, Qiang Zhang, Guoyang Zhao, Jiahang Cao, Xueyin Luo, Yushan Zhang, Jinglan Xu, Ruoyu Geng, Yulin Li, Andrew F. Luo, Jun Ma,
Abstract要約: 既存の方法は、しばしば時間を要するタスク固有のポリシートレーニングを必要とするか、厳格な軌跡の再生に依存している。我々は,人型ロボットがゼロショット方式で多様なオブジェクトインタラクションタスクを実行できるフレームワークであるtextitGenHOIを提案する。提案手法を多種多様なオブジェクト・インタラクション・タスクにまたがる広範囲なシミュレーションおよび実世界の実験で検証する。
参考スコア（独自算出の注目度）: 20.414478780328437
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Humanoid-Object Interaction (HOI) is a fundamental capability for humanoid robots, yet it remains challenging due to the tight coupling between dynamic balance and stable interaction with diverse objects. Existing methods often require time-consuming task-specific policy training or rely on rigid trajectory replay, which limits their ability to accommodate novel interaction scenarios. In this work, we present \textit{GenHOI}, a simple yet effective framework that enables humanoid robots to perform diverse object-interaction tasks in a zero-shot manner by directly imitating a single generated video, without task-specific training or physical demonstration data. GenHOI first reconstructs the robot-object scene in simulation and renders a first-frame image, which, together with the language command, conditions the synthesis of a task-oriented interaction video. The generated video is then analyzed to identify interaction-relevant contact events and estimate hand-object contact regions, which are encoded as object-centric geometric constraints that convert visual interaction cues into physically grounded optimization priors. Guided by these priors, the reference motion recovered from the video is refined and smoothed to resolve the scale ambiguity inherent in 2D video generation, while adapting a single reference trajectory to unseen robot-object relative poses. The optimized trajectory is finally executed by a closed-loop tracking controller. We validate the proposed framework in extensive simulation and real-world experiments across diverse object-interaction tasks, including box grasping, asymmetric bimanual chair carrying, table lifting from below, and cylindrical-object enveloping.
Abstract（参考訳）: ヒューマノイド・オブジェクト・インタラクション(HOI)はヒューマノイドロボットの基本的な機能であるが、動的バランスと多様な物体との安定した相互作用の密接な結合により、依然として困難である。既存の方法は、しばしば時間を要するタスク固有のポリシートレーニングを必要とするか、または、新しい相互作用シナリオに対応する能力を制限する厳格な軌道リプレイに依存している。本研究では,人間型ロボットがタスク固有のトレーニングや実演データを使わずに,単一生成した映像を直接模倣することにより,多様なオブジェクトインタラクションタスクをゼロショットで実行可能にする,シンプルで効果的なフレームワークである「textit{GenHOI}」を提案する。 GenHOIはまず、ロボットオブジェクトシーンをシミュレーションで再構築し、第1フレーム画像をレンダリングし、言語コマンドとともにタスク指向の対話ビデオの合成を条件付ける。生成したビデオは、相互作用関連接触事象を識別し、手対象接触領域を推定するために分析され、視覚的相互作用のキューを物理的に基底化された最適化の先行値に変換するオブジェクト中心の幾何学的制約として符号化される。これらの先行技術により、ビデオから回収された参照運動は洗練され、滑らかにされ、2次元ビデオ生成に固有のスケールのあいまいさを解消し、単一の参照軌道をロボットの相対的なポーズに適応させる。最適化された軌道は、最終的にクローズドループトラッキングコントローラによって実行される。提案手法は,箱握り,非対称なバイマニュアルチェア搬送,下からのテーブルリフト,円筒形物体の包み込みなど多種多様なオブジェクトインタラクションタスクを対象とした,広範囲なシミュレーションおよび実世界の実験において検証された。

論文の概要: GenHOI: Contact-Aware Humanoid-Object Interaction by Imitating Generated Videos without Task-Specific Training

関連論文リスト