Related papers: TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions

TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions

URL: http://arxiv.org/abs/2510.14874v1
Date: Thu, 16 Oct 2025 16:52:58 GMT
Title: TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions
Authors: Guangyi Han, Wei Zhai, Yuhang Yang, Yang Cao, Zheng-Jun Zha,
Abstract summary: Free-Form HOI Generation aims to generate controllable, diverse, and physically plausible HOI conditioned on fine-grained intent.<n>We construct WildO2, an in-the-wild diverse 3D HOI dataset, which includes diverse HOI derived from internet videos.<n>Building on this dataset, we propose TOUCH, a three-stage framework that facilitates fine-grained semantic control to generate versatile hand poses.
Score: 66.08264566003048
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hand-object interaction (HOI) is fundamental for humans to express intent. Existing HOI generation research is predominantly confined to fixed grasping patterns, where control is tied to physical priors such as force closure or generic intent instructions, even when expressed through elaborate language. Such an overly general conditioning imposes a strong inductive bias for stable grasps, thus failing to capture the diversity of daily HOI. To address these limitations, we introduce Free-Form HOI Generation, which aims to generate controllable, diverse, and physically plausible HOI conditioned on fine-grained intent, extending HOI from grasping to free-form interactions, like pushing, poking, and rotating. To support this task, we construct WildO2, an in-the-wild diverse 3D HOI dataset, which includes diverse HOI derived from internet videos. Specifically, it contains 4.4k unique interactions across 92 intents and 610 object categories, each with detailed semantic annotations. Building on this dataset, we propose TOUCH, a three-stage framework centered on a multi-level diffusion model that facilitates fine-grained semantic control to generate versatile hand poses beyond grasping priors. This process leverages explicit contact modeling for conditioning and is subsequently refined with contact consistency and physical constraints to ensure realism. Comprehensive experiments demonstrate our method's ability to generate controllable, diverse, and physically plausible hand interactions representative of daily activities. The project page is $\href{https://guangyid.github.io/hoi123touch}{here}$.

Related papers

UniHM: Unified Dexterous Hand Manipulation with Vision Language Model [39.2419824041854]
Planning physically feasible dexterous hand manipulation is a central challenge in robotic manipulation and Embodied AI.<n>We introduce UniHM, the first framework for unified dexterous hand manipulation guided by free-form language commands.
arXiv Detail & Related papers (2026-02-28T16:37:11Z)
CoopDiff: Anticipating 3D Human-object Interactions via Contact-consistent Decoupled Diffusion [62.93198247045824]
3D human-object interaction (HOI) anticipation aims to predict the future motion of humans and their manipulated objects, conditioned on the historical context.<n>We propose a novel contact-consistent decoupled diffusion framework CoopDiff, which employs two distinct branches to decouple human and object motion modeling.
arXiv Detail & Related papers (2025-08-10T03:29:17Z)
HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene Perception [57.37135310143126]
HO SIG is a novel framework for synthesizing full-body interactions through hierarchical scene perception.<n>Our framework supports unlimited motion length through autoregressive generation and requires minimal manual intervention.<n>This work bridges the critical gap between scene-aware navigation and dexterous object manipulation.
arXiv Detail & Related papers (2025-06-02T12:08:08Z)
FunHOI: Annotation-Free 3D Hand-Object Interaction Generation via Functional Text Guidanc [9.630837159704004]
Hand-object interaction (HOI) is the fundamental link between human and environment.<n>Despite advances in AI and robotics, capturing the semantics of functional grasping tasks remains a considerable challenge.<n>We propose an innovative two-stage framework, Functional Grasp Synthesis Net (FGS-Net), for generating 3D HOI driven by functional text.
arXiv Detail & Related papers (2025-02-28T07:42:54Z)
ClickDiff: Click to Induce Semantic Contact Map for Controllable Grasp Generation with Diffusion Models [17.438429495623755]
ClickDiff is a controllable conditional generation model that leverages a fine-grained Semantic Contact Map. Within this framework, the Semantic Conditional Module generates reasonable contact maps based on fine-grained contact information. We evaluate the validity of our proposed method, demonstrating the efficacy and robustness of ClickDiff, even with previously unseen objects.
arXiv Detail & Related papers (2024-07-28T02:42:29Z)
G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis [57.07638884476174]
G-HOP is a denoising diffusion based generative prior for hand-object interactions. We represent the human hand via a skeletal distance field to obtain a representation aligned with the signed distance field for the object. We show that this hand-object prior can then serve as generic guidance to facilitate other tasks like reconstruction from interaction clip and human grasp synthesis.
arXiv Detail & Related papers (2024-04-18T17:59:28Z)
Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction [8.253265795150401]
This paper introduces the first text-guided work for generating the sequence of hand-object interaction in 3D. For contact generation, a VAE-based network takes as input a text and an object mesh, and generates the probability of contacts between the surfaces of hands and the object. For motion generation, a Transformer-based diffusion model utilizes this 3D contact map as a strong prior for generating physically plausible hand-object motion.
arXiv Detail & Related papers (2024-03-31T04:56:30Z)
DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions [15.417836855005087]
We propose a novel method, dubbed DiffH2O, which can synthesize realistic, one or two-handed object interactions.<n>The method introduces three techniques that enable effective learning from limited data.
arXiv Detail & Related papers (2024-03-26T16:06:42Z)
Controllable Human-Object Interaction Synthesis [77.56877961681462]
We propose Controllable Human-Object Interaction Synthesis (CHOIS) to generate synchronized object motion and human motion in 3D scenes. Here, language descriptions inform style and intent, and waypoints, which can be effectively extracted from high-level planning, ground the motion in the scene. Our module seamlessly integrates with a path planning module, enabling the generation of long-term interactions in 3D environments.
arXiv Detail & Related papers (2023-12-06T21:14:20Z)
GRIP: Generating Interaction Poses Using Spatial Cues and Latent Consistency [57.9920824261925]
Hands are dexterous and highly versatile manipulators that are central to how humans interact with objects and their environment. modeling realistic hand-object interactions is critical for applications in computer graphics, computer vision, and mixed reality. GRIP is a learning-based method that takes as input the 3D motion of the body and the object, and synthesizes realistic motion for both hands before, during, and after object interaction.
arXiv Detail & Related papers (2023-08-22T17:59:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.