HOI-PAGE: Zero-Shot Human-Object Interaction Generation with Part Affordance Guidance
- URL: http://arxiv.org/abs/2506.07209v1
- Date: Sun, 08 Jun 2025 16:15:39 GMT
- Title: HOI-PAGE: Zero-Shot Human-Object Interaction Generation with Part Affordance Guidance
- Authors: Lei Li, Angela Dai,
- Abstract summary: We present HOI-, a new approach to synthesizing 4D human-object interactions from text prompts.<n>Part Affordance Graphs (PAGs) encode fine-grained part information along with contact relations.<n>Our approach is flexible and capable of generating complex multi-object or multi-person interaction sequences.
- Score: 33.77779848399525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present HOI-PAGE, a new approach to synthesizing 4D human-object interactions (HOIs) from text prompts in a zero-shot fashion, driven by part-level affordance reasoning. In contrast to prior works that focus on global, whole body-object motion for 4D HOI synthesis, we observe that generating realistic and diverse HOIs requires a finer-grained understanding -- at the level of how human body parts engage with object parts. We thus introduce Part Affordance Graphs (PAGs), a structured HOI representation distilled from large language models (LLMs) that encodes fine-grained part information along with contact relations. We then use these PAGs to guide a three-stage synthesis: first, decomposing input 3D objects into geometric parts; then, generating reference HOI videos from text prompts, from which we extract part-based motion constraints; finally, optimizing for 4D HOI motion sequences that not only mimic the reference dynamics but also satisfy part-level contact constraints. Extensive experiments show that our approach is flexible and capable of generating complex multi-object or multi-person interaction sequences, with significantly improved realism and text alignment for zero-shot 4D HOI generation.
Related papers
- GenHOI: Generalizing Text-driven 4D Human-Object Interaction Synthesis for Unseen Objects [13.830968058014546]
GenHOI is a two-stage framework aimed at achieving two key objectives: 1) generalization to unseen objects and 2) the synthesis of high-fidelity 4D HOI sequences.<n>We introduce a Contact-Aware Diffusion Model (ContactDM) in the second stage to seamlessly interpolate 3D HOIs into densely temporally coherent 4D HOI sequences.<n> Experimental results show that we achieve state-of-the-art results on the publicly available OMOMODM and 3D-FUTURE datasets.
arXiv Detail & Related papers (2025-06-18T14:17:53Z) - InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing [36.29681929804816]
We propose a novel zero-shot 3D HOI generation framework without training on specific datasets.<n>We use a pre-trained 2D image diffusion model to parse unseen objects and extract contact points.<n>We then introduce a detailed optimization to generate fine-grained, precise, and natural interaction, enforcing realistic 3D contact between the 3D object and the involved body parts.
arXiv Detail & Related papers (2025-05-30T07:53:55Z) - Zero-Shot Human-Object Interaction Synthesis with Multimodal Priors [31.277540988829976]
This paper proposes a novel zero-shot HOI synthesis framework without relying on end-to-end training on currently limited 3D HOI datasets.<n>We employ pre-trained human pose estimation models to extract human poses and introduce a generalizable category-level 6-DoF estimation method to obtain the object poses from 2D HOI images.
arXiv Detail & Related papers (2025-03-25T23:55:47Z) - BimArt: A Unified Approach for the Synthesis of 3D Bimanual Interaction with Articulated Objects [70.20706475051347]
BimArt is a novel generative approach for synthesizing 3D bimanual hand interactions with articulated objects.<n>We first generate distance-based contact maps conditioned on the object trajectory with an articulation-aware feature representation.<n>The learned contact prior is then used to guide our hand motion generator, producing diverse and realistic bimanual motions for object movement and articulation.
arXiv Detail & Related papers (2024-12-06T14:23:56Z) - AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation [60.5897687447003]
AvatarGO is a novel framework designed to generate realistic 4D HOI scenes from textual inputs.
Our framework not only generates coherent compositional motions, but also exhibits greater robustness in handling issues.
As the first attempt to synthesize 4D avatars with object interactions, we hope AvatarGO could open new doors for human-centric 4D content creation.
arXiv Detail & Related papers (2024-10-09T17:58:56Z) - HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects [86.86284624825356]
HIMO is a dataset of full-body human interacting with multiple objects.
HIMO contains 3.3K 4D HOI sequences and 4.08M 3D HOI frames.
arXiv Detail & Related papers (2024-07-17T07:47:34Z) - CORE4D: A 4D Human-Object-Human Interaction Dataset for Collaborative Object REarrangement [24.287902864042792]
We present CORE4D, a novel large-scale 4D human-object collaboration dataset.<n>With 1K human-object-human motion sequences captured in the real world, we enrich CORE4D by contributing an iterative collaboration strategy to augment motions to a variety of novel objects.<n>Benefiting from extensive motion patterns provided by CORE4D, we benchmark two tasks aiming at generating human-object interaction: human-object motion forecasting and interaction synthesis.
arXiv Detail & Related papers (2024-06-27T17:32:18Z) - TC4D: Trajectory-Conditioned Text-to-4D Generation [94.90700997568158]
We propose TC4D: trajectory-conditioned text-to-4D generation, which factors motion into global and local components.
We learn local deformations that conform to the global trajectory using supervision from a text-to-video model.
Our approach enables the synthesis of scenes animated along arbitrary trajectories, compositional scene generation, and significant improvements to the realism and amount of generated motion.
arXiv Detail & Related papers (2024-03-26T17:55:11Z) - ROAM: Robust and Object-Aware Motion Generation Using Neural Pose
Descriptors [73.26004792375556]
This paper shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object.
We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object.
We demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects.
arXiv Detail & Related papers (2023-08-24T17:59:51Z) - Full-Body Articulated Human-Object Interaction [61.01135739641217]
CHAIRS is a large-scale motion-captured f-AHOI dataset consisting of 16.2 hours of versatile interactions.
CHAIRS provides 3D meshes of both humans and articulated objects during the entire interactive process.
By learning the geometrical relationships in HOI, we devise the very first model that leverage human pose estimation.
arXiv Detail & Related papers (2022-12-20T19:50:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.