Related papers: Towards Semantic 3D Hand-Object Interaction Generation via Functional Text Guidance

Towards Semantic 3D Hand-Object Interaction Generation via Functional Text Guidance

URL: http://arxiv.org/abs/2502.20805v1
Date: Fri, 28 Feb 2025 07:42:54 GMT
Title: Towards Semantic 3D Hand-Object Interaction Generation via Functional Text Guidance
Authors: Yongqi Tian, Xueyu Sun, Haoyuan He, Linji Hao, Ning Ding, Caigui Jiang,
Abstract summary: Hand-object interaction (HOI) is the fundamental link between human and environment.<n>Despite advances in AI and robotics, capturing the semantics of functional grasping tasks remains a considerable challenge.<n>We propose an innovative two-stage framework, Functional Grasp Synthesis Net (FGS-Net), for generating 3D HOI driven by functional text.
Score: 9.630837159704004
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hand-object interaction(HOI) is the fundamental link between human and environment, yet its dexterous and complex pose significantly challenges for gesture control. Despite significant advances in AI and robotics, enabling machines to understand and simulate hand-object interactions, capturing the semantics of functional grasping tasks remains a considerable challenge. While previous work can generate stable and correct 3D grasps, they are still far from achieving functional grasps due to unconsidered grasp semantics. To address this challenge, we propose an innovative two-stage framework, Functional Grasp Synthesis Net (FGS-Net), for generating 3D HOI driven by functional text. This framework consists of a text-guided 3D model generator, Functional Grasp Generator (FGG), and a pose optimization strategy, Functional Grasp Refiner (FGR). FGG generates 3D models of hands and objects based on text input, while FGR fine-tunes the poses using Object Pose Approximator and energy functions to ensure the relative position between the hand and object aligns with human intent and remains physically plausible. Extensive experiments demonstrate that our approach achieves precise and high-quality HOI generation without requiring additional 3D annotation data.

Related papers

TIGeR: Text-Instructed Generation and Refinement for Template-Free Hand-Object Interaction [43.61297194416115]
We propose a new Text-Instructed Generation and Refinement (TIGeR) framework to steer the object shape refinement and pose estimation.<n>We use a two-stage framework: a text-instructed prior generation and vision-guided refinement.<n>TIGeR achieves competitive performance, i.e., 1.979 and 5.468 object Chamfer distance on the widely-used Dex-YCB and Obman datasets.
arXiv Detail & Related papers (2025-06-01T10:56:16Z)
InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing [36.29681929804816]
We propose a novel zero-shot 3D HOI generation framework without training on specific datasets.<n>We use a pre-trained 2D image diffusion model to parse unseen objects and extract contact points.<n>We then introduce a detailed optimization to generate fine-grained, precise, and natural interaction, enforcing realistic 3D contact between the 3D object and the involved body parts.
arXiv Detail & Related papers (2025-05-30T07:53:55Z)
IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction. We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images. We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z)
SIGHT: Single-Image Conditioned Generation of Hand Trajectories for Hand-Object Interaction [86.54738165527502]
We introduce a novel task of generating realistic and diverse 3D hand trajectories given a single image of an object. Hand-object interaction trajectory priors can greatly benefit applications in robotics, embodied AI, augmented reality and related fields.
arXiv Detail & Related papers (2025-03-28T20:53:20Z)
Zero-Shot Human-Object Interaction Synthesis with Multimodal Priors [31.277540988829976]
This paper proposes a novel zero-shot HOI synthesis framework without relying on end-to-end training on currently limited 3D HOI datasets. We employ pre-trained human pose estimation models to extract human poses and introduce a generalizable category-level 6-DoF estimation method to obtain the object poses from 2D HOI images.
arXiv Detail & Related papers (2025-03-25T23:55:47Z)
UniHOPE: A Unified Approach for Hand-Only and Hand-Object Pose Estimation [82.93208597526503]
Existing methods are specialized, focusing on either bare-hand or hand interacting with object.<n>No method can flexibly handle both scenarios and their performance degrades when applied to the other scenario.<n>We propose UniHOPE, a unified approach for general 3D hand-object pose estimation.
arXiv Detail & Related papers (2025-03-17T15:46:43Z)
FunGraph: Functionality Aware 3D Scene Graphs for Language-Prompted Scene Interaction [1.8124328823188356]
We focus on detecting and storing objects at a finer resolution, focusing on affordance-relevant parts. We leverage currently available 3D resources to generate 2D data and train a detector, which is then used to augment the standard 3D scene graph generation pipeline.
arXiv Detail & Related papers (2025-03-10T23:13:35Z)
EigenActor: Variant Body-Object Interaction Generation Evolved from Invariant Action Basis Reasoning [66.68366281305977]
This paper explores a cross-modality synthesis task that infers 3D human-object interactions (HOIs) from a given text-based instruction. Existing text-to-HOI synthesis methods mainly deploy a direct mapping from texts to object-specific 3D body motions. We propose a novel body pose generation strategy for the text-to-HOI task: infer object-agnostic canonical body action first and then enrich object-specific interaction styles.
arXiv Detail & Related papers (2025-03-01T07:15:10Z)
HOGSA: Bimanual Hand-Object Interaction Understanding with 3D Gaussian Splatting Based Data Augmentation [29.766317710266765]
We propose a new 3D Gaussian Splatting based data augmentation framework for bimanual hand-object interaction.<n>We use mesh-based 3DGS to model objects and hands, and to deal with the rendering blur problem due to multi-resolution input images used.<n>We extend the single hand grasping pose optimization module for the bimanual hand object to generate various poses of bimanual hand-object interaction.
arXiv Detail & Related papers (2025-01-06T08:48:17Z)
Learning Granularity-Aware Affordances from Human-Object Interaction for Tool-Based Functional Grasping in Dexterous Robotics [27.124273762587848]
Affordance features of objects serve as a bridge in the functional interaction between agents and objects. We propose a granularity-aware affordance feature extraction method for locating functional affordance areas. We also use highly activated coarse-grained affordance features in hand-object interaction regions to predict grasp gestures. This forms a complete dexterous robotic functional grasping framework GAAF-Dex.
arXiv Detail & Related papers (2024-06-30T07:42:57Z)
Atlas3D: Physically Constrained Self-Supporting Text-to-3D for Simulation and Fabrication [50.541882834405946]
We introduce Atlas3D, an automatic and easy-to-implement text-to-3D method. Our approach combines a novel differentiable simulation-based loss function with physically inspired regularization. We verify Atlas3D's efficacy through extensive generation tasks and validate the resulting 3D models in both simulated and real-world environments.
arXiv Detail & Related papers (2024-05-28T18:33:18Z)
Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction [8.253265795150401]
This paper introduces the first text-guided work for generating the sequence of hand-object interaction in 3D. For contact generation, a VAE-based network takes as input a text and an object mesh, and generates the probability of contacts between the surfaces of hands and the object. For motion generation, a Transformer-based diffusion model utilizes this 3D contact map as a strong prior for generating physically plausible hand-object motion.
arXiv Detail & Related papers (2024-03-31T04:56:30Z)
InterFusion: Text-Driven Generation of 3D Human-Object Interaction [38.380079482331745]
We tackle the complex task of generating 3D human-object interactions (HOI) from textual descriptions in a zero-shot text-to-3D manner. We present InterFusion, a two-stage framework specifically designed for HOI generation. Our experimental results affirm that InterFusion significantly outperforms existing state-of-the-art methods in 3D HOI generation.
arXiv Detail & Related papers (2024-03-22T20:49:26Z)
Controllable Human-Object Interaction Synthesis [77.56877961681462]
We propose Controllable Human-Object Interaction Synthesis (CHOIS) to generate synchronized object motion and human motion in 3D scenes. Here, language descriptions inform style and intent, and waypoints, which can be effectively extracted from high-level planning, ground the motion in the scene. Our module seamlessly integrates with a path planning module, enabling the generation of long-term interactions in 3D environments.
arXiv Detail & Related papers (2023-12-06T21:14:20Z)
GRIP: Generating Interaction Poses Using Spatial Cues and Latent Consistency [57.9920824261925]
Hands are dexterous and highly versatile manipulators that are central to how humans interact with objects and their environment. modeling realistic hand-object interactions is critical for applications in computer graphics, computer vision, and mixed reality. GRIP is a learning-based method that takes as input the 3D motion of the body and the object, and synthesizes realistic motion for both hands before, during, and after object interaction.
arXiv Detail & Related papers (2023-08-22T17:59:51Z)
Generating Holistic 3D Human Motion from Speech [97.11392166257791]
We build a high-quality dataset of 3D holistic body meshes with synchronous speech. We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately.
arXiv Detail & Related papers (2022-12-08T17:25:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.