Fugu-MT 論文翻訳(概要): ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

論文の概要: ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

arxiv url: http://arxiv.org/abs/2603.04338v1
Date: Wed, 04 Mar 2026 17:58:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-05 21:29:15.439533
Title: ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors
Title（参考訳）: ArtHOI:ビデオ先行画像からの4次元再構成による人工人間と物体の相互作用合成
Authors: Zihao Huang, Tianqi Liu, Zhaoxi Chen, Shaocong Xu, Saining Zhang, Lixing Xiao, Zhiguo Cao, Wei Li, Hao Zhao, Ziwei Liu,
Abstract要約: ビデオ先行画像からの4D再構成による人-物間相互作用合成のための最初のゼロショットフレームワークであるArtHOIを紹介する。 ArtHOIは、ビデオベースの生成と幾何学的認識の再構築をブリッジし、セマンティックアライメントと物理的基盤の両方のインタラクションを生成する。
参考スコア（独自算出の注目度）: 51.06020148149403
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Synthesizing physically plausible articulated human-object interactions (HOI) without 3D/4D supervision remains a fundamental challenge. While recent zero-shot approaches leverage video diffusion models to synthesize human-object interactions, they are largely confined to rigid-object manipulation and lack explicit 4D geometric reasoning. To bridge this gap, we formulate articulated HOI synthesis as a 4D reconstruction problem from monocular video priors: given only a video generated by a diffusion model, we reconstruct a full 4D articulated scene without any 3D supervision. This reconstruction-based approach treats the generated 2D video as supervision for an inverse rendering problem, recovering geometrically consistent and physically plausible 4D scenes that naturally respect contact, articulation, and temporal coherence. We introduce ArtHOI, the first zero-shot framework for articulated human-object interaction synthesis via 4D reconstruction from video priors. Our key designs are: 1) Flow-based part segmentation: leveraging optical flow as a geometric cue to disentangle dynamic from static regions in monocular video; 2) Decoupled reconstruction pipeline: joint optimization of human motion and object articulation is unstable under monocular ambiguity, so we first recover object articulation, then synthesize human motion conditioned on the reconstructed object states. ArtHOI bridges video-based generation and geometry-aware reconstruction, producing interactions that are both semantically aligned and physically grounded. Across diverse articulated scenes (e.g., opening fridges, cabinets, microwaves), ArtHOI significantly outperforms prior methods in contact accuracy, penetration reduction, and articulation fidelity, extending zero-shot interaction synthesis beyond rigid manipulation through reconstruction-informed synthesis.
Abstract（参考訳）: 3D/4Dの監督なしに物理的に可塑性な人-物体相互作用(HOI)を合成することは、依然として根本的な課題である。最近のゼロショットアプローチでは、ビデオ拡散モデルを利用して人間とオブジェクトの相互作用を合成するが、それらは剛体オブジェクトの操作に限られており、明示的な4次元幾何学的推論が欠如している。このギャップを埋めるために,単眼ビデオからHOI合成を4次元再構成問題として定式化し,拡散モデルで生成されたビデオのみを考慮し,3次元の監督なしに全4次元合成シーンを再構築する。この再構成に基づくアプローチは、生成した2D映像を逆レンダリング問題の監督として扱い、接触、調音、時間的コヒーレンスを自然に尊重する幾何学的に一貫した物理的に妥当な4Dシーンを復元する。ビデオ先行画像からの4D再構成による人-物間相互作用合成のための最初のゼロショットフレームワークであるArtHOIを紹介する。私たちの重要なデザインは以下のとおりです。 1)フローベース部分分割:光学的流れを幾何学的キューとして活用してモノクロ映像の静的領域から動的に絡み合う 2) 分離された再建パイプライン: 単分子的あいまいさ下では, 人間の動きと物体の関節の関節が不安定であるため, まず物体の関節を復元し, 再構成された物体状態に基づいて人間の動きを合成する。 ArtHOIは、ビデオベースの生成と幾何学的認識の再構築をブリッジし、セマンティックアライメントと物理的基盤の両方のインタラクションを生成する。様々な調音シーン(例えば、冷蔵庫、キャビネット、電子レンジ)において、ArtHOIは接触精度、浸透率の低下、および調音の忠実さにおいて、従来手法よりも著しく優れており、再構成インフォームド合成により、剛性操作を超えてゼロショット相互作用合成を拡張している。

論文の概要: ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

関連論文リスト