Related papers: UniHM: Unified Dexterous Hand Manipulation with Vision Language Model

UniHM: Unified Dexterous Hand Manipulation with Vision Language Model

URL: http://arxiv.org/abs/2603.00732v1
Date: Sat, 28 Feb 2026 16:37:11 GMT
Title: UniHM: Unified Dexterous Hand Manipulation with Vision Language Model
Authors: Zhenhao Zhang, Jiaxin Liu, Ye Shi, Jingya Wang,
Abstract summary: Planning physically feasible dexterous hand manipulation is a central challenge in robotic manipulation and Embodied AI.<n>We introduce UniHM, the first framework for unified dexterous hand manipulation guided by free-form language commands.
Score: 39.2419824041854
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Planning physically feasible dexterous hand manipulation is a central challenge in robotic manipulation and Embodied AI. Prior work typically relies on object-centric cues or precise hand-object interaction sequences, foregoing the rich, compositional guidance of open-vocabulary instruction. We introduce UniHM, the first framework for unified dexterous hand manipulation guided by free-form language commands. We propose a Unified Hand-Dexterous Tokenizer that maps heterogeneous dexterous-hand morphologies into a single shared codebook, improving cross-dexterous hand generalization and scalability to new morphologies. Our vision language action model is trained solely on human-object interaction data, eliminating the need for massive real-world teleoperation datasets, and demonstrates strong generalizability in producing human-like manipulation sequences from open-ended language instructions. To ensure physical realism, we introduce a physics-guided dynamic refinement module that performs segment-wise joint optimization under generative and temporal priors, yielding smooth and physically feasible manipulation sequences. Across multiple datasets and real-world evaluations, UniHM attains state-of-the-art results on both seen and unseen objects and trajectories, demonstrating strong generalization and high physical feasibility. Our project page at \href{https://unihm.github.io/}{https://unihm.github.io/}.

Related papers

Generalizable Geometric Prior and Recurrent Spiking Feature Learning for Humanoid Robot Manipulation [90.90219129619344]
This paper presents a novel R-prior-S, Recurrent Geometric-priormodal Policy with Spiking features.<n>To ground high-level reasoning in physical reality, we leverage lightweight 2D geometric inductive biases.<n>For the data efficiency issue in robotic action generation, we introduce a Recursive Adaptive Spiking Network.
arXiv Detail & Related papers (2026-01-13T23:36:30Z)
SynHLMA:Synthesizing Hand Language Manipulation for Articulated Object with Discrete Human Object Interaction Representation [20.50790587356819]
This paper proposes a novel HAOI sequence generation framework SynHLMA.<n>We use a discrete HAOI representation to model each hand object interaction frame.<n>Along with the natural language embeddings, the representations are trained by an HAOI manipulation language model.<n>A joint-aware loss is employed to ensure hand grasps follow the dynamic variations of articulated object joints.
arXiv Detail & Related papers (2025-10-29T08:27:00Z)
DexCanvas: Bridging Human Demonstrations and Robot Learning for Dexterous Manipulation [25.208854363099352]
This dataset contains 7,000 hours of dexterous hand-object interactions seeded from 70 hours of real human demonstrations.<n>Each entry combines synchronized multi-view RGB-D, high-precision mocap with MANO hand parameters, and per-frame contact points with physically consistent force profiles.<n>Our real-to-sim pipeline uses reinforcement learning to train policies that control an actuated MANO hand in physics simulation.
arXiv Detail & Related papers (2025-10-17T16:08:14Z)
TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions [66.08264566003048]
Free-Form HOI Generation aims to generate controllable, diverse, and physically plausible HOI conditioned on fine-grained intent.<n>We construct WildO2, an in-the-wild diverse 3D HOI dataset, which includes diverse HOI derived from internet videos.<n>Building on this dataset, we propose TOUCH, a three-stage framework that facilitates fine-grained semantic control to generate versatile hand poses.
arXiv Detail & Related papers (2025-10-16T16:52:58Z)
OpenHOI: Open-World Hand-Object Interaction Synthesis with Multimodal Large Language Model [22.545267010077822]
We introduce OpenHOI, the first framework for open-world HOI synthesis.<n>Our approach integrates a 3D Multimodal Large Language Model (MLLM) fine-tuned for joint affordance grounding and semantic task decomposition.<n>To synthesize physically plausible interactions, we propose an affordance-driven diffusion model paired with a training-free physics refinement stage.
arXiv Detail & Related papers (2025-05-25T02:48:43Z)
Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction. The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z)
Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z)
Programmatically Grounded, Compositionally Generalizable Robotic Manipulation [35.12811184353626]
We show that the conventional pretraining-finetuning pipeline for integrating semantic representations entangles the learning of domain-specific action information. We propose a modular approach to better leverage pretrained models by exploiting the syntactic and semantic structures of language instructions. Our model successfully disentangles action and perception, translating to improved zero-shot and compositional generalization in a variety of manipulation behaviors.
arXiv Detail & Related papers (2023-04-26T20:56:40Z)
Deep Imitation Learning for Bimanual Robotic Manipulation [70.56142804957187]
We present a deep imitation learning framework for robotic bimanual manipulation. A core challenge is to generalize the manipulation skills to objects in different locations. We propose to (i) decompose the multi-modal dynamics into elemental movement primitives, (ii) parameterize each primitive using a recurrent graph neural network to capture interactions, and (iii) integrate a high-level planner that composes primitives sequentially and a low-level controller to combine primitive dynamics and inverse kinematics control.
arXiv Detail & Related papers (2020-10-11T01:40:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.