Related papers: ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

URL: http://arxiv.org/abs/2603.04338v1
Date: Wed, 04 Mar 2026 17:58:04 GMT
Title: ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors
Authors: Zihao Huang, Tianqi Liu, Zhaoxi Chen, Shaocong Xu, Saining Zhang, Lixing Xiao, Zhiguo Cao, Wei Li, Hao Zhao, Ziwei Liu,
Abstract summary: We introduce ArtHOI, the first zero-shot framework for articulated human-object interaction synthesis via 4D reconstruction from video priors.<n>ArtHOI bridges video-based generation and geometry-aware reconstruction, producing interactions that are both semantically aligned and physically grounded.
Score: 51.06020148149403
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Synthesizing physically plausible articulated human-object interactions (HOI) without 3D/4D supervision remains a fundamental challenge. While recent zero-shot approaches leverage video diffusion models to synthesize human-object interactions, they are largely confined to rigid-object manipulation and lack explicit 4D geometric reasoning. To bridge this gap, we formulate articulated HOI synthesis as a 4D reconstruction problem from monocular video priors: given only a video generated by a diffusion model, we reconstruct a full 4D articulated scene without any 3D supervision. This reconstruction-based approach treats the generated 2D video as supervision for an inverse rendering problem, recovering geometrically consistent and physically plausible 4D scenes that naturally respect contact, articulation, and temporal coherence. We introduce ArtHOI, the first zero-shot framework for articulated human-object interaction synthesis via 4D reconstruction from video priors. Our key designs are: 1) Flow-based part segmentation: leveraging optical flow as a geometric cue to disentangle dynamic from static regions in monocular video; 2) Decoupled reconstruction pipeline: joint optimization of human motion and object articulation is unstable under monocular ambiguity, so we first recover object articulation, then synthesize human motion conditioned on the reconstructed object states. ArtHOI bridges video-based generation and geometry-aware reconstruction, producing interactions that are both semantically aligned and physically grounded. Across diverse articulated scenes (e.g., opening fridges, cabinets, microwaves), ArtHOI significantly outperforms prior methods in contact accuracy, penetration reduction, and articulation fidelity, extending zero-shot interaction synthesis beyond rigid manipulation through reconstruction-informed synthesis.

Related papers

MeshMimic: Geometry-Aware Humanoid Motion Learning through 3D Scene Reconstruction [54.36564144414704]
MeshMimic is an innovative framework that bridges 3D scene reconstruction and embodied intelligence to enable humanoid robots to learn coupled "motion-terrain" interactions directly from video.<n>By leveraging state-of-the-art 3D vision models, our framework precisely segments and reconstructs both human trajectories and the underlying 3D geometry of terrains and objects.
arXiv Detail & Related papers (2026-02-17T17:09:45Z)
CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives [65.89192712575797]
We introduce CRISP, a method that recovers simulatable human motion and scene geometry from monocular video.<n>Our approach reduces motion tracking failure rates from 55.2% to 6.9% on human-centric video benchmarks.<n>This demonstrates CRISP's ability to generate physically-valid human motion and interaction environments at scale.
arXiv Detail & Related papers (2025-12-16T18:59:50Z)
CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction [40.557276644446475]
We present CARI4D, the first category-agnostic method that reconstructs spatially and temporarily consistent 4D human-object interaction at metric scale from monocular RGB videos.<n>Our model generalizes beyond the training categories and thus can be applied zero-shot to in-the-wild internet videos.
arXiv Detail & Related papers (2025-12-12T19:11:11Z)
VideoArtGS: Building Digital Twins of Articulated Objects from Monocular Video [60.63575135514847]
Building digital twins of articulated objects from monocular video presents an essential challenge in computer vision.<n>We introduce VideoArtGS, a novel approach that reconstructs high-fidelity digital twins of articulated objects from monocular video.<n>VideoArtGS demonstrates state-of-the-art performance in articulation and mesh reconstruction, reducing the reconstruction error by about two orders of magnitude compared to existing methods.
arXiv Detail & Related papers (2025-09-22T11:52:02Z)
HumanGenesis: Agent-Based Geometric and Generative Modeling for Synthetic Human Dynamics [60.737929335600015]
We present textbfHumanGenesis, a framework that integrates geometric and generative modeling through four collaborative agents.<n>HumanGenesis achieves state-of-the-art performance on tasks including text-guided synthesis, video reenactment, and novel-pose generalization.
arXiv Detail & Related papers (2025-08-13T14:50:19Z)
Restage4D: Reanimating Deformable 3D Reconstruction from a Single Video [56.781766315691854]
We introduce textbfRestage4D, a geometry-preserving pipeline for video-conditioned 4D restaging.<n>We validate Restage4D on DAVIS and PointOdyssey, demonstrating improved geometry consistency, motion quality, and 3D tracking performance.
arXiv Detail & Related papers (2025-08-08T21:31:51Z)
HOI-PAGE: Zero-Shot Human-Object Interaction Generation with Part Affordance Guidance [33.77779848399525]
We present HOI-, a new approach to synthesizing 4D human-object interactions from text prompts.<n>Part Affordance Graphs (PAGs) encode fine-grained part information along with contact relations.<n>Our approach is flexible and capable of generating complex multi-object or multi-person interaction sequences.
arXiv Detail & Related papers (2025-06-08T16:15:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.