Related papers: DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions

DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions

URL: http://arxiv.org/abs/2403.17827v1
Date: Tue, 26 Mar 2024 16:06:42 GMT
Title: DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions
Authors: Sammy Christen, Shreyas Hampali, Fadime Sener, Edoardo Remelli, Tomas Hodan, Eric Sauser, Shugao Ma, Bugra Tekin,
Abstract summary: We propose DiffH2O, a novel method to synthesize realistic, one or two-handed object interactions. We decompose the task into a grasping stage and a text-based interaction stage. In the grasping stage, the model only generates hand motions, whereas in the interaction phase both hand and object poses are synthesized.
Score: 15.417836855005087
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generating natural hand-object interactions in 3D is challenging as the resulting hand and object motions are expected to be physically plausible and semantically meaningful. Furthermore, generalization to unseen objects is hindered by the limited scale of available hand-object interaction datasets. We propose DiffH2O, a novel method to synthesize realistic, one or two-handed object interactions from provided text prompts and geometry of the object. The method introduces three techniques that enable effective learning from limited data. First, we decompose the task into a grasping stage and a text-based interaction stage and use separate diffusion models for each. In the grasping stage, the model only generates hand motions, whereas in the interaction phase both hand and object poses are synthesized. Second, we propose a compact representation that tightly couples hand and object poses. Third, we propose two different guidance schemes to allow more control of the generated motions: grasp guidance and detailed textual guidance. Grasp guidance takes a single target grasping pose and guides the diffusion model to reach this grasp at the end of the grasping stage, which provides control over the grasping pose. Given a grasping motion from this stage, multiple different actions can be prompted in the interaction phase. For textual guidance, we contribute comprehensive text descriptions to the GRAB dataset and show that they enable our method to have more fine-grained control over hand-object interactions. Our quantitative and qualitative evaluation demonstrates that the proposed method outperforms baseline methods and leads to natural hand-object motions. Moreover, we demonstrate the practicality of our framework by utilizing a hand pose estimate from an off-the-shelf pose estimator for guidance, and then sampling multiple different actions in the interaction stage.

Related papers

SIGHT: Single-Image Conditioned Generation of Hand Trajectories for Hand-Object Interaction [86.54738165527502]
We introduce a novel task of generating realistic and diverse 3D hand trajectories given a single image of an object. Hand-object interaction trajectory priors can greatly benefit applications in robotics, embodied AI, augmented reality and related fields.
arXiv Detail & Related papers (2025-03-28T20:53:20Z)
EigenActor: Variant Body-Object Interaction Generation Evolved from Invariant Action Basis Reasoning [66.68366281305977]
This paper explores a cross-modality synthesis task that infers 3D human-object interactions (HOIs) from a given text-based instruction. Existing text-to-HOI synthesis methods mainly deploy a direct mapping from texts to object-specific 3D body motions. We propose a novel body pose generation strategy for the text-to-HOI task: infer object-agnostic canonical body action first and then enrich object-specific interaction styles.
arXiv Detail & Related papers (2025-03-01T07:15:10Z)
Diffgrasp: Whole-Body Grasping Synthesis Guided by Object Motion Using a Diffusion Model [25.00532805042292]
We propose a simple yet effective framework that jointly models the relationship between the body, hands, and the given object motion sequences. We introduce novel contact-aware losses and incorporate a data-driven, carefully designed guidance. Experimental results demonstrate that our approach outperforms the state-of-the-art method and generates plausible whole-body motion sequences.
arXiv Detail & Related papers (2024-12-30T02:21:43Z)
BimArt: A Unified Approach for the Synthesis of 3D Bimanual Interaction with Articulated Objects [70.20706475051347]
BimArt is a novel generative approach for synthesizing 3D bimanual hand interactions with articulated objects. We first generate distance-based contact maps conditioned on the object trajectory with an articulation-aware feature representation. The learned contact prior is then used to guide our hand motion generator, producing diverse and realistic bimanual motions for object movement and articulation.
arXiv Detail & Related papers (2024-12-06T14:23:56Z)
ManiDext: Hand-Object Manipulation Synthesis via Continuous Correspondence Embeddings and Residual-Guided Diffusion [36.9457697304841]
ManiDext is a unified hierarchical diffusion-based framework for generating hand manipulation and grasp poses. Our key insight is that accurately modeling the contact correspondences between objects and hands during interactions is crucial. Our framework first generates contact maps and correspondence embeddings on the object's surface. Based on these fine-grained correspondences, we introduce a novel approach that integrates the iterative refinement process into the diffusion process.
arXiv Detail & Related papers (2024-09-14T04:28:44Z)
Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction [8.253265795150401]
This paper introduces the first text-guided work for generating the sequence of hand-object interaction in 3D. For contact generation, a VAE-based network takes as input a text and an object mesh, and generates the probability of contacts between the surfaces of hands and the object. For motion generation, a Transformer-based diffusion model utilizes this 3D contact map as a strong prior for generating physically plausible hand-object motion.
arXiv Detail & Related papers (2024-03-31T04:56:30Z)
Gaze-guided Hand-Object Interaction Synthesis: Dataset and Method [63.49140028965778]
We present GazeHOI, the first dataset to capture simultaneous 3D modeling of gaze, hand, and object interactions. To tackle these issues, we propose a stacked gaze-guided hand-object interaction diffusion model, named GHO-Diffusion. We also introduce HOI-Manifold Guidance during the sampling stage of GHO-Diffusion, enabling fine-grained control over generated motions.
arXiv Detail & Related papers (2024-03-24T14:24:13Z)
THOR: Text to Human-Object Interaction Diffusion via Relation Intervention [51.02435289160616]
We propose a novel Text-guided Human-Object Interaction diffusion model with Relation Intervention (THOR) In each diffusion step, we initiate text-guided human and object motion and then leverage human-object relations to intervene in object motion. We construct Text-BEHAVE, a Text2HOI dataset that seamlessly integrates textual descriptions with the currently largest publicly available 3D HOI dataset.
arXiv Detail & Related papers (2024-03-17T13:17:25Z)
Hand-Centric Motion Refinement for 3D Hand-Object Interaction via Hierarchical Spatial-Temporal Modeling [18.128376292350836]
We propose a data-driven method for coarse hand motion refinement. First, we design a hand-centric representation to describe the dynamic spatial-temporal relation between hands and objects. Second, to capture the dynamic clues of hand-object interaction, we propose a new architecture.
arXiv Detail & Related papers (2024-01-29T09:17:51Z)
BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics [50.88842027976421]
We propose BOTH57M, a novel multi-modal dataset for two-hand motion generation. Our dataset includes accurate motion tracking for the human body and hands. We also provide a strong baseline method, BOTH2Hands, for the novel task.
arXiv Detail & Related papers (2023-12-13T07:30:19Z)
Controllable Human-Object Interaction Synthesis [77.56877961681462]
We propose Controllable Human-Object Interaction Synthesis (CHOIS) to generate synchronized object motion and human motion in 3D scenes. Here, language descriptions inform style and intent, and waypoints, which can be effectively extracted from high-level planning, ground the motion in the scene. Our module seamlessly integrates with a path planning module, enabling the generation of long-term interactions in 3D environments.
arXiv Detail & Related papers (2023-12-06T21:14:20Z)
GRIP: Generating Interaction Poses Using Spatial Cues and Latent Consistency [57.9920824261925]
Hands are dexterous and highly versatile manipulators that are central to how humans interact with objects and their environment. modeling realistic hand-object interactions is critical for applications in computer graphics, computer vision, and mixed reality. GRIP is a learning-based method that takes as input the 3D motion of the body and the object, and synthesizes realistic motion for both hands before, during, and after object interaction.
arXiv Detail & Related papers (2023-08-22T17:59:51Z)
Novel-view Synthesis and Pose Estimation for Hand-Object Interaction from Sparse Views [41.50710846018882]
We propose a neural rendering and pose estimation system for hand-object interaction from sparse views. We first learn the shape and appearance prior knowledge of hands and objects separately with the neural representation. During the online stage, we design a rendering-based joint model fitting framework to understand the dynamic hand-object interaction.
arXiv Detail & Related papers (2023-08-22T05:17:41Z)
Learning to Disambiguate Strongly Interacting Hands via Probabilistic Per-pixel Part Segmentation [84.28064034301445]
Self-similarity, and the resulting ambiguities in assigning pixel observations to the respective hands, is a major cause of the final 3D pose error. We propose DIGIT, a novel method for estimating the 3D poses of two interacting hands from a single monocular image. We experimentally show that the proposed approach achieves new state-of-the-art performance on the InterHand2.6M dataset.
arXiv Detail & Related papers (2021-07-01T13:28:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.