Fugu-MT 論文翻訳(概要): GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection

論文の概要: GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection

arxiv url: http://arxiv.org/abs/2603.06048v1
Date: Fri, 06 Mar 2026 09:01:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:45.395889
Title: GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection
Title（参考訳）: Genhoi: 時間的バランスと空間的選択的オブジェクト注入によるオブジェクトとオブジェクトの相互作用を目指して
Authors: Xuan Huang, Mochu Xiang, Zhelun Shen, Jinbo Wu, Chenming Wu, Chen Zhao, Kaisiyuan Wang, Hang Zhou, Shanshan Liu, Haocheng Feng, Wei He, Jingdong Wang,
Abstract要約: GenHOIは、事前訓練されたビデオ生成モデルに対する軽量な拡張である。参照オブジェクト情報を時間的にバランスよく空間的に選択的に注入する。 GenHOIは最先端のHOI再現やオールインワンのビデオ編集方法よりも優れています。
参考スコア（独自算出の注目度）: 54.879037588415656
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hand-Object Interaction (HOI) remains a core challenge in digital human video synthesis, where models must generate physically plausible contact and preserve object identity across frames. Although recent HOI reenactment approaches have achieved progress, they are typically trained and evaluated in-domain and fail to generalize to complex, in-the-wild scenarios. In contrast, all-in-one video editing models exhibit broader robustness but still struggle with HOI-specific issues such as inconsistent object appearance. In this paper, we present GenHOI, a lightweight augmentation to pretrained video generation models that injects reference-object information in a temporally balanced and spatially selective manner. For temporal balancing, we propose Head-Sliding RoPE, which assigns head-specific temporal offsets to reference tokens, distributing their influence evenly across frames and mitigating the temporal decay of 3D RoPE to improve long-range object consistency. For spatial selectivity, we design a two-level spatial attention gate that concentrates object-conditioned attention on HOI regions and adaptively scales its strength, preserving background realism while enhancing interaction fidelity. Extensive qualitative and quantitative evaluations on unseen, in-the-wild scenes demonstrate that GenHOI significantly outperforms state-of-the-art HOI reenactment and all-in-one video editing methods. Project page: https://xuanhuang0.github.io/GenHOI/
Abstract（参考訳）: ハンドオブジェクトインタラクション(HOI)はデジタル人間のビデオ合成において依然として中心的な課題であり、モデルが物理的に妥当な接触を生成し、フレーム全体でオブジェクトのアイデンティティを保持する必要がある。最近のHOI再現アプローチは進歩を遂げているが、通常はドメイン内で訓練され評価され、複雑なインザワイルドシナリオへの一般化に失敗する。対照的に、オールインワンのビデオ編集モデルはより広範な堅牢性を示すが、一貫性のないオブジェクトの外観のようなHOI固有の問題に苦戦している。本稿では,時間的バランスの取れた空間的選択的な方法で参照対象情報を注入する,事前学習ビデオ生成モデルに対する軽量化であるGenHOIを提案する。本稿では,頭部固有の時間オフセットを基準トークンに割り当て,フレーム間を均等に分散し,長期オブジェクトの整合性を改善するために3D RoPEの時間減衰を緩和するヘッドスライディングRoPEを提案する。空間選択性のために,HoI領域にオブジェクト条件の注意を集中させ,その強度を適応的に拡張する2レベル空間注意ゲートを設計し,相互作用の忠実性を高めながら背景リアリズムを保った。 GenHOIは最先端のHOI再現法とオールインワンビデオ編集法を著しく上回っていることを示す。プロジェクトページ:https://xuanhuang0.github.io/GenHOI/

論文の概要: GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection

関連論文リスト