Related papers: TIGeR: Text-Instructed Generation and Refinement for Template-Free Hand-Object Interaction

TIGeR: Text-Instructed Generation and Refinement for Template-Free Hand-Object Interaction

URL: http://arxiv.org/abs/2506.00953v1
Date: Sun, 01 Jun 2025 10:56:16 GMT
Title: TIGeR: Text-Instructed Generation and Refinement for Template-Free Hand-Object Interaction
Authors: Yiyao Huang, Zhedong Zheng, Yu Ziwei, Yaxiong Wang, Tze Ho Elden Tse, Angela Yao,
Abstract summary: We propose a new Text-Instructed Generation and Refinement (TIGeR) framework to steer the object shape refinement and pose estimation.<n>We use a two-stage framework: a text-instructed prior generation and vision-guided refinement.<n>TIGeR achieves competitive performance, i.e., 1.979 and 5.468 object Chamfer distance on the widely-used Dex-YCB and Obman datasets.
Score: 43.61297194416115
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pre-defined 3D object templates are widely used in 3D reconstruction of hand-object interactions. However, they often require substantial manual efforts to capture or source, and inherently restrict the adaptability of models to unconstrained interaction scenarios, e.g., heavily-occluded objects. To overcome this bottleneck, we propose a new Text-Instructed Generation and Refinement (TIGeR) framework, harnessing the power of intuitive text-driven priors to steer the object shape refinement and pose estimation. We use a two-stage framework: a text-instructed prior generation and vision-guided refinement. As the name implies, we first leverage off-the-shelf models to generate shape priors according to the text description without tedious 3D crafting. Considering the geometric gap between the synthesized prototype and the real object interacted with the hand, we further calibrate the synthesized prototype via 2D-3D collaborative attention. TIGeR achieves competitive performance, i.e., 1.979 and 5.468 object Chamfer distance on the widely-used Dex-YCB and Obman datasets, respectively, surpassing existing template-free methods. Notably, the proposed framework shows robustness to occlusion, while maintaining compatibility with heterogeneous prior sources, e.g., retrieved hand-crafted prototypes, in practical deployment scenarios.

Related papers

InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing [36.29681929804816]
We propose a novel zero-shot 3D HOI generation framework without training on specific datasets.<n>We use a pre-trained 2D image diffusion model to parse unseen objects and extract contact points.<n>We then introduce a detailed optimization to generate fine-grained, precise, and natural interaction, enforcing realistic 3D contact between the 3D object and the involved body parts.
arXiv Detail & Related papers (2025-05-30T07:53:55Z)
MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation [28.75149480374178]
MEgoHand is a framework that synthesizes physically plausible hand-object interactions from egocentric RGB, text, and initial hand pose.<n>It achieves substantial reductions in wrist translation error and joint rotation error, highlighting its capacity to accurately model fine-grained hand joint structures.
arXiv Detail & Related papers (2025-05-22T12:37:47Z)
SIGHT: Synthesizing Image-Text Conditioned and Geometry-Guided 3D Hand-Object Trajectories [124.24041272390954]
Modeling hand-object interaction priors holds significant potential to advance robotic and embodied AI systems.<n>We introduce SIGHT, a novel task focused on generating realistic and physically plausible 3D hand-object interaction trajectories from a single image.<n>We propose SIGHT-Fusion, a novel diffusion-based image-text conditioned generative model that tackles this task by retrieving the most similar 3D object mesh from a database.
arXiv Detail & Related papers (2025-03-28T20:53:20Z)
Towards Semantic 3D Hand-Object Interaction Generation via Functional Text Guidance [9.630837159704004]
Hand-object interaction (HOI) is the fundamental link between human and environment.<n>Despite advances in AI and robotics, capturing the semantics of functional grasping tasks remains a considerable challenge.<n>We propose an innovative two-stage framework, Functional Grasp Synthesis Net (FGS-Net), for generating 3D HOI driven by functional text.
arXiv Detail & Related papers (2025-02-28T07:42:54Z)
EasyHOI: Unleashing the Power of Large Models for Reconstructing Hand-Object Interactions in the Wild [79.71523320368388]
Our work aims to reconstruct hand-object interactions from a single-view image.<n>We first design a novel pipeline to estimate the underlying hand pose and object shape.<n>With the initial reconstruction, we employ a prior-guided optimization scheme.
arXiv Detail & Related papers (2024-11-21T16:33:35Z)
Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction [8.253265795150401]
This paper introduces the first text-guided work for generating the sequence of hand-object interaction in 3D. For contact generation, a VAE-based network takes as input a text and an object mesh, and generates the probability of contacts between the surfaces of hands and the object. For motion generation, a Transformer-based diffusion model utilizes this 3D contact map as a strong prior for generating physically plausible hand-object motion.
arXiv Detail & Related papers (2024-03-31T04:56:30Z)
HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video [70.11702620562889]
HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video. We develop a compositional articulated implicit model that can disentangled 3D hand and object from 2D images. Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings.
arXiv Detail & Related papers (2023-11-30T10:50:35Z)
T2TD: Text-3D Generation Model based on Prior Knowledge Guidance [74.32278935880018]
We propose a novel text-3D generation model (T2TD), which introduces the related shapes or textual information as the prior knowledge to improve the performance of the 3D generation model. Our approach significantly improves 3D model generation quality and outperforms the SOTA methods on the text2shape datasets.
arXiv Detail & Related papers (2023-05-25T06:05:52Z)
Monocular 3D Reconstruction of Interacting Hands via Collision-Aware Factorized Refinements [96.40125818594952]
We make the first attempt to reconstruct 3D interacting hands from monocular single RGB images. Our method can generate 3D hand meshes with both precise 3D poses and minimal collisions.
arXiv Detail & Related papers (2021-11-01T08:24:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.