UniHM: Unified Dexterous Hand Manipulation with Vision Language Model
- URL: http://arxiv.org/abs/2603.00732v1
- Date: Sat, 28 Feb 2026 16:37:11 GMT
- Title: UniHM: Unified Dexterous Hand Manipulation with Vision Language Model
- Authors: Zhenhao Zhang, Jiaxin Liu, Ye Shi, Jingya Wang,
- Abstract summary: Planning physically feasible dexterous hand manipulation is a central challenge in robotic manipulation and Embodied AI.<n>We introduce UniHM, the first framework for unified dexterous hand manipulation guided by free-form language commands.
- Score: 39.2419824041854
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Planning physically feasible dexterous hand manipulation is a central challenge in robotic manipulation and Embodied AI. Prior work typically relies on object-centric cues or precise hand-object interaction sequences, foregoing the rich, compositional guidance of open-vocabulary instruction. We introduce UniHM, the first framework for unified dexterous hand manipulation guided by free-form language commands. We propose a Unified Hand-Dexterous Tokenizer that maps heterogeneous dexterous-hand morphologies into a single shared codebook, improving cross-dexterous hand generalization and scalability to new morphologies. Our vision language action model is trained solely on human-object interaction data, eliminating the need for massive real-world teleoperation datasets, and demonstrates strong generalizability in producing human-like manipulation sequences from open-ended language instructions. To ensure physical realism, we introduce a physics-guided dynamic refinement module that performs segment-wise joint optimization under generative and temporal priors, yielding smooth and physically feasible manipulation sequences. Across multiple datasets and real-world evaluations, UniHM attains state-of-the-art results on both seen and unseen objects and trajectories, demonstrating strong generalization and high physical feasibility. Our project page at \href{https://unihm.github.io/}{https://unihm.github.io/}.
Related papers
- Generalizable Geometric Prior and Recurrent Spiking Feature Learning for Humanoid Robot Manipulation [90.90219129619344]
This paper presents a novel R-prior-S, Recurrent Geometric-priormodal Policy with Spiking features.<n>To ground high-level reasoning in physical reality, we leverage lightweight 2D geometric inductive biases.<n>For the data efficiency issue in robotic action generation, we introduce a Recursive Adaptive Spiking Network.
arXiv Detail & Related papers (2026-01-13T23:36:30Z) - SynHLMA:Synthesizing Hand Language Manipulation for Articulated Object with Discrete Human Object Interaction Representation [20.50790587356819]
This paper proposes a novel HAOI sequence generation framework SynHLMA.<n>We use a discrete HAOI representation to model each hand object interaction frame.<n>Along with the natural language embeddings, the representations are trained by an HAOI manipulation language model.<n>A joint-aware loss is employed to ensure hand grasps follow the dynamic variations of articulated object joints.
arXiv Detail & Related papers (2025-10-29T08:27:00Z) - DexCanvas: Bridging Human Demonstrations and Robot Learning for Dexterous Manipulation [25.208854363099352]
This dataset contains 7,000 hours of dexterous hand-object interactions seeded from 70 hours of real human demonstrations.<n>Each entry combines synchronized multi-view RGB-D, high-precision mocap with MANO hand parameters, and per-frame contact points with physically consistent force profiles.<n>Our real-to-sim pipeline uses reinforcement learning to train policies that control an actuated MANO hand in physics simulation.
arXiv Detail & Related papers (2025-10-17T16:08:14Z) - TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions [66.08264566003048]
Free-Form HOI Generation aims to generate controllable, diverse, and physically plausible HOI conditioned on fine-grained intent.<n>We construct WildO2, an in-the-wild diverse 3D HOI dataset, which includes diverse HOI derived from internet videos.<n>Building on this dataset, we propose TOUCH, a three-stage framework that facilitates fine-grained semantic control to generate versatile hand poses.
arXiv Detail & Related papers (2025-10-16T16:52:58Z) - OpenHOI: Open-World Hand-Object Interaction Synthesis with Multimodal Large Language Model [22.545267010077822]
We introduce OpenHOI, the first framework for open-world HOI synthesis.<n>Our approach integrates a 3D Multimodal Large Language Model (MLLM) fine-tuned for joint affordance grounding and semantic task decomposition.<n>To synthesize physically plausible interactions, we propose an affordance-driven diffusion model paired with a training-free physics refinement stage.
arXiv Detail & Related papers (2025-05-25T02:48:43Z) - Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - Programmatically Grounded, Compositionally Generalizable Robotic
Manipulation [35.12811184353626]
We show that the conventional pretraining-finetuning pipeline for integrating semantic representations entangles the learning of domain-specific action information.
We propose a modular approach to better leverage pretrained models by exploiting the syntactic and semantic structures of language instructions.
Our model successfully disentangles action and perception, translating to improved zero-shot and compositional generalization in a variety of manipulation behaviors.
arXiv Detail & Related papers (2023-04-26T20:56:40Z) - Deep Imitation Learning for Bimanual Robotic Manipulation [70.56142804957187]
We present a deep imitation learning framework for robotic bimanual manipulation.
A core challenge is to generalize the manipulation skills to objects in different locations.
We propose to (i) decompose the multi-modal dynamics into elemental movement primitives, (ii) parameterize each primitive using a recurrent graph neural network to capture interactions, and (iii) integrate a high-level planner that composes primitives sequentially and a low-level controller to combine primitive dynamics and inverse kinematics control.
arXiv Detail & Related papers (2020-10-11T01:40:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.