Related papers: EigenActor: Variant Body-Object Interaction Generation Evolved from Invariant Action Basis Reasoning

EigenActor: Variant Body-Object Interaction Generation Evolved from Invariant Action Basis Reasoning

URL: http://arxiv.org/abs/2503.00382v2
Date: Tue, 04 Mar 2025 02:17:57 GMT
Title: EigenActor: Variant Body-Object Interaction Generation Evolved from Invariant Action Basis Reasoning
Authors: Xuehao Gao, Yang Yang, Shaoyi Du, Yang Wu, Yebin Liu, Guo-Jun Qi,
Abstract summary: This paper explores a cross-modality synthesis task that infers 3D human-object interactions (HOIs) from a given text-based instruction.<n>Existing text-to-HOI synthesis methods mainly deploy a direct mapping from texts to object-specific 3D body motions.<n>We propose a novel body pose generation strategy for the text-to-HOI task: infer object-agnostic canonical body action first and then enrich object-specific interaction styles.
Score: 66.68366281305977
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper explores a cross-modality synthesis task that infers 3D human-object interactions (HOIs) from a given text-based instruction. Existing text-to-HOI synthesis methods mainly deploy a direct mapping from texts to object-specific 3D body motions, which may encounter a performance bottleneck since the huge cross-modality gap. In this paper, we observe that those HOI samples with the same interaction intention toward different targets, e.g., "lift a chair" and "lift a cup", always encapsulate similar action-specific body motion patterns while characterizing different object-specific interaction styles. Thus, learning effective action-specific motion priors and object-specific interaction priors is crucial for a text-to-HOI model and dominates its performances on text-HOI semantic consistency and body-object interaction realism. In light of this, we propose a novel body pose generation strategy for the text-to-HOI task: infer object-agnostic canonical body action first and then enrich object-specific interaction styles. Specifically, the first canonical body action inference stage focuses on learning intra-class shareable body motion priors and mapping given text-based semantics to action-specific canonical 3D body motions. Then, in the object-specific interaction inference stage, we focus on object affordance learning and enrich object-specific interaction styles on an inferred action-specific body motion basis. Extensive experiments verify that our proposed text-to-HOI synthesis system significantly outperforms other SOTA methods on three large-scale datasets with better semantic consistency and interaction realism performances.

Related papers

IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction. We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images. We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z)
SIGHT: Synthesizing Image-Text Conditioned and Geometry-Guided 3D Hand-Object Trajectories [124.24041272390954]
Modeling hand-object interaction priors holds significant potential to advance robotic and embodied AI systems.<n>We introduce SIGHT, a novel task focused on generating realistic and physically plausible 3D hand-object interaction trajectories from a single image.<n>We propose SIGHT-Fusion, a novel diffusion-based image-text conditioned generative model that tackles this task by retrieving the most similar 3D object mesh from a database.
arXiv Detail & Related papers (2025-03-28T20:53:20Z)
BimArt: A Unified Approach for the Synthesis of 3D Bimanual Interaction with Articulated Objects [70.20706475051347]
BimArt is a novel generative approach for synthesizing 3D bimanual hand interactions with articulated objects.<n>We first generate distance-based contact maps conditioned on the object trajectory with an articulation-aware feature representation.<n>The learned contact prior is then used to guide our hand motion generator, producing diverse and realistic bimanual motions for object movement and articulation.
arXiv Detail & Related papers (2024-12-06T14:23:56Z)
TextIM: Part-aware Interactive Motion Synthesis from Text [25.91739105467082]
TextIM is a novel framework for synthesizing TEXT-driven human Interactive Motions. Our approach leverages large language models, functioning as a human brain, to identify interacting human body parts. For training and evaluation, we carefully selected and re-labeled interactive motions from HUMANML3D to develop a specialized dataset.
arXiv Detail & Related papers (2024-08-06T17:08:05Z)
InterDreamer: Zero-Shot Text to 3D Dynamic Human-Object Interaction [27.10256777126629]
This paper showcases the potential of generating human-object interactions without direct training on text-interaction pair data. We introduce a world model designed to comprehend simple physics, modeling how human actions influence object motion. By integrating these components, our novel framework, InterDreamer, is able to generate text-aligned 3D HOI sequences in a zero-shot manner.
arXiv Detail & Related papers (2024-03-28T17:59:30Z)
DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions [15.417836855005087]
We propose a novel method, dubbed DiffH2O, which can synthesize realistic, one or two-handed object interactions.<n>The method introduces three techniques that enable effective learning from limited data.
arXiv Detail & Related papers (2024-03-26T16:06:42Z)
THOR: Text to Human-Object Interaction Diffusion via Relation Intervention [51.02435289160616]
We propose a novel Text-guided Human-Object Interaction diffusion model with Relation Intervention (THOR) In each diffusion step, we initiate text-guided human and object motion and then leverage human-object relations to intervene in object motion. We construct Text-BEHAVE, a Text2HOI dataset that seamlessly integrates textual descriptions with the currently largest publicly available 3D HOI dataset.
arXiv Detail & Related papers (2024-03-17T13:17:25Z)
Controllable Human-Object Interaction Synthesis [77.56877961681462]
We propose Controllable Human-Object Interaction Synthesis (CHOIS) to generate synchronized object motion and human motion in 3D scenes. Here, language descriptions inform style and intent, and waypoints, which can be effectively extracted from high-level planning, ground the motion in the scene. Our module seamlessly integrates with a path planning module, enabling the generation of long-term interactions in 3D environments.
arXiv Detail & Related papers (2023-12-06T21:14:20Z)
ROAM: Robust and Object-Aware Motion Generation Using Neural Pose Descriptors [73.26004792375556]
This paper shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object. We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object. We demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects.
arXiv Detail & Related papers (2023-08-24T17:59:51Z)
GRIP: Generating Interaction Poses Using Spatial Cues and Latent Consistency [57.9920824261925]
Hands are dexterous and highly versatile manipulators that are central to how humans interact with objects and their environment. modeling realistic hand-object interactions is critical for applications in computer graphics, computer vision, and mixed reality. GRIP is a learning-based method that takes as input the 3D motion of the body and the object, and synthesizes realistic motion for both hands before, during, and after object interaction.
arXiv Detail & Related papers (2023-08-22T17:59:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.