THOR: Text to Human-Object Interaction Diffusion via Relation Intervention
- URL: http://arxiv.org/abs/2403.11208v1
- Date: Sun, 17 Mar 2024 13:17:25 GMT
- Title: THOR: Text to Human-Object Interaction Diffusion via Relation Intervention
- Authors: Qianyang Wu, Ye Shi, Xiaoshui Huang, Jingyi Yu, Lan Xu, Jingya Wang,
- Abstract summary: We propose a novel Text-guided Human-Object Interaction diffusion model with Relation Intervention (THOR)
In each diffusion step, we initiate text-guided human and object motion and then leverage human-object relations to intervene in object motion.
We construct Text-BEHAVE, a Text2HOI dataset that seamlessly integrates textual descriptions with the currently largest publicly available 3D HOI dataset.
- Score: 51.02435289160616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses new methodologies to deal with the challenging task of generating dynamic Human-Object Interactions from textual descriptions (Text2HOI). While most existing works assume interactions with limited body parts or static objects, our task involves addressing the variation in human motion, the diversity of object shapes, and the semantic vagueness of object motion simultaneously. To tackle this, we propose a novel Text-guided Human-Object Interaction diffusion model with Relation Intervention (THOR). THOR is a cohesive diffusion model equipped with a relation intervention mechanism. In each diffusion step, we initiate text-guided human and object motion and then leverage human-object relations to intervene in object motion. This intervention enhances the spatial-temporal relations between humans and objects, with human-centric interaction representation providing additional guidance for synthesizing consistent motion from text. To achieve more reasonable and realistic results, interaction losses is introduced at different levels of motion granularity. Moreover, we construct Text-BEHAVE, a Text2HOI dataset that seamlessly integrates textual descriptions with the currently largest publicly available 3D HOI dataset. Both quantitative and qualitative experiments demonstrate the effectiveness of our proposed model.
Related papers
- Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues.
Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z) - TextIM: Part-aware Interactive Motion Synthesis from Text [25.91739105467082]
TextIM is a novel framework for synthesizing TEXT-driven human Interactive Motions.
Our approach leverages large language models, functioning as a human brain, to identify interacting human body parts.
For training and evaluation, we carefully selected and re-labeled interactive motions from HUMANML3D to develop a specialized dataset.
arXiv Detail & Related papers (2024-08-06T17:08:05Z) - InterDreamer: Zero-Shot Text to 3D Dynamic Human-Object Interaction [27.10256777126629]
This paper showcases the potential of generating human-object interactions without direct training on text-interaction pair data.
We introduce a world model designed to comprehend simple physics, modeling how human actions influence object motion.
By integrating these components, our novel framework, InterDreamer, is able to generate text-aligned 3D HOI sequences in a zero-shot manner.
arXiv Detail & Related papers (2024-03-28T17:59:30Z) - Inter-X: Towards Versatile Human-Human Interaction Analysis [100.254438708001]
We propose Inter-X, a dataset with accurate body movements and diverse interaction patterns.
The dataset includes 11K interaction sequences and more than 8.1M frames.
We also equip Inter-X with versatile annotations of more than 34K fine-grained human part-level textual descriptions.
arXiv Detail & Related papers (2023-12-26T13:36:05Z) - HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models [42.62823339416957]
We address the problem of generating realistic 3D human-object interactions (HOIs) driven by textual prompts.
We first develop a dual-branch diffusion model (HOI-DM) to generate both human and object motions conditioned on the input text.
We also develop an affordance prediction diffusion model (APDM) to predict the contacting area between the human and object.
arXiv Detail & Related papers (2023-12-11T17:41:17Z) - Controllable Human-Object Interaction Synthesis [77.56877961681462]
We propose Controllable Human-Object Interaction Synthesis (CHOIS) to generate synchronized object motion and human motion in 3D scenes.
Here, language descriptions inform style and intent, and waypoints, which can be effectively extracted from high-level planning, ground the motion in the scene.
Our module seamlessly integrates with a path planning module, enabling the generation of long-term interactions in 3D environments.
arXiv Detail & Related papers (2023-12-06T21:14:20Z) - InterDiff: Generating 3D Human-Object Interactions with Physics-Informed
Diffusion [29.25063155767897]
This paper addresses a novel task of anticipating 3D human-object interactions (HOIs)
Our task is significantly more challenging, as it requires modeling dynamic objects with various shapes, capturing whole-body motion, and ensuring physically valid interactions.
Experiments on multiple human-object interaction datasets demonstrate the effectiveness of our method for this task, capable of producing realistic, vivid, and remarkably long-term 3D HOI predictions.
arXiv Detail & Related papers (2023-08-31T17:59:08Z) - NIFTY: Neural Object Interaction Fields for Guided Human Motion
Synthesis [21.650091018774972]
We create a neural interaction field attached to a specific object, which outputs the distance to the valid interaction manifold given a human pose as input.
This interaction field guides the sampling of an object-conditioned human motion diffusion model.
We synthesize realistic motions for sitting and lifting with several objects, outperforming alternative approaches in terms of motion quality and successful action completion.
arXiv Detail & Related papers (2023-07-14T17:59:38Z) - Full-Body Articulated Human-Object Interaction [61.01135739641217]
CHAIRS is a large-scale motion-captured f-AHOI dataset consisting of 16.2 hours of versatile interactions.
CHAIRS provides 3D meshes of both humans and articulated objects during the entire interactive process.
By learning the geometrical relationships in HOI, we devise the very first model that leverage human pose estimation.
arXiv Detail & Related papers (2022-12-20T19:50:54Z) - Learn to Predict How Humans Manipulate Large-sized Objects from
Interactive Motions [82.90906153293585]
We propose a graph neural network, HO-GCN, to fuse motion data and dynamic descriptors for the prediction task.
We show the proposed network that consumes dynamic descriptors can achieve state-of-the-art prediction results and help the network better generalize to unseen objects.
arXiv Detail & Related papers (2022-06-25T09:55:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.