UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling
- URL: http://arxiv.org/abs/2602.21631v1
- Date: Wed, 25 Feb 2026 06:53:15 GMT
- Title: UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling
- Authors: Zhihao Sun, Tong Wu, Ruirui Tu, Daoguo Dong, Zuxuan Wu,
- Abstract summary: UniHand is a unified diffusion-based framework that formulates both estimation and generation as conditional motion synthesis.<n>Visual observations are encoded with a frozen backbone, while a dedicated hand perceptron extracts hand-specific cues directly from image features.<n>A latent diffusion model then synthesizes consistent motion sequences from diverse conditions.
- Score: 45.29560152294065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hand motion plays a central role in human interaction, yet modeling realistic 4D hand motion (i.e., 3D hand pose sequences over time) remains challenging. Research in this area is typically divided into two tasks: (1) Estimation approaches reconstruct precise motion from visual observations, but often fail under hand occlusion or absence; (2) Generation approaches focus on synthesizing hand poses by exploiting generative priors under multi-modal structured inputs and infilling motion from incomplete sequences. However, this separation not only limits the effective use of heterogeneous condition signals that frequently arise in practice, but also prevents knowledge transfer between the two tasks. We present UniHand, a unified diffusion-based framework that formulates both estimation and generation as conditional motion synthesis. UniHand integrates heterogeneous inputs by embedding structured signals into a shared latent space through a joint variational autoencoder, which aligns conditions such as MANO parameters and 2D skeletons. Visual observations are encoded with a frozen vision backbone, while a dedicated hand perceptron extracts hand-specific cues directly from image features, removing the need for complex detection and cropping pipelines. A latent diffusion model then synthesizes consistent motion sequences from these diverse conditions. Extensive experiments across multiple benchmarks demonstrate that UniHand delivers robust and accurate hand motion modeling, maintaining performance under severe occlusions and temporally incomplete inputs.
Related papers
- FUSION: Full-Body Unified Motion Prior for Body and Hands via Diffusion [49.026972478098266]
Hands are central to interacting with our surroundings and conveying gestures.<n>Existing human motion synthesis methods fall short.<n>Key obstacle is the lack of large-scale datasets that jointly capture diverse full-body motion.
arXiv Detail & Related papers (2026-01-07T14:18:59Z) - CoDA: Coordinated Diffusion Noise Optimization for Whole-Body Manipulation of Articulated Objects [14.230098033626744]
Whole-body manipulation of articulated objects is a critical yet challenging task with broad applications in virtual humans and robotics.<n>We propose a novel coordinated diffusion noise optimization framework to achieve realistic whole-body motion.<n>We conduct extensive experiments demonstrating that our method outperforms existing approaches in motion quality and physical plausibility.
arXiv Detail & Related papers (2025-05-27T17:11:50Z) - Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion [33.9786226622757]
We propose a robust motion generation framework MoMADiff to generate 3D human motion from text descriptions.<n>Our model supports flexible user-provided specification, enabling precise control over both spatial and temporal aspects of motion synthesis.<n>Our method consistently outperforms state-of-the-art models in motion quality, instruction fidelity, and adherence.
arXiv Detail & Related papers (2025-05-16T09:06:15Z) - GENMO: A GENeralist Model for Human MOtion [64.16188966024542]
We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework.<n>Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals.<n>Our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control.
arXiv Detail & Related papers (2025-05-02T17:59:55Z) - Learning to Align and Refine: A Foundation-to-Diffusion Framework for Occlusion-Robust Two-Hand Reconstruction [50.952228546326516]
Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures.<n>Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts.<n>We propose a dual-stage Foundation-to-Diffusion framework that precisely align 2D prior guidance from vision foundation models.
arXiv Detail & Related papers (2025-03-22T14:42:27Z) - Gaze-guided Hand-Object Interaction Synthesis: Dataset and Method [61.19028558470065]
We present GazeHOI, the first dataset to capture simultaneous 3D modeling of gaze, hand, and object interactions.<n>To tackle these issues, we propose a stacked gaze-guided hand-object interaction diffusion model, named GHO-Diffusion.<n>We also introduce HOI-Manifold Guidance during the sampling stage of GHO-Diffusion, enabling fine-grained control over generated motions.
arXiv Detail & Related papers (2024-03-24T14:24:13Z) - GRIP: Generating Interaction Poses Using Spatial Cues and Latent Consistency [57.9920824261925]
Hands are dexterous and highly versatile manipulators that are central to how humans interact with objects and their environment.
modeling realistic hand-object interactions is critical for applications in computer graphics, computer vision, and mixed reality.
GRIP is a learning-based method that takes as input the 3D motion of the body and the object, and synthesizes realistic motion for both hands before, during, and after object interaction.
arXiv Detail & Related papers (2023-08-22T17:59:51Z) - Diverse 3D Hand Gesture Prediction from Body Dynamics by Bilateral Hand
Disentanglement [42.98335775548796]
We introduce a novel bilateral hand disentanglement based two-stage 3D hand generation method.
In the first stage, we intend to generate natural hand gestures by two hand-disentanglement branches.
The second stage is built upon the insight that 3D hand predictions should be non-deterministic.
arXiv Detail & Related papers (2023-03-03T08:08:04Z) - A Non-Anatomical Graph Structure for isolated hand gesture separation in
continuous gesture sequences [42.20687552354674]
We propose a GCN model and combine it with the stacked Bi-LSTM and Attention modules to push the temporal information in the video stream.
Considering the breakthroughs of GCN models for skeleton modality, we propose a two-layer GCN model to empower the 3D hand skeleton features.
arXiv Detail & Related papers (2022-07-15T17:28:52Z) - Monocular 3D Reconstruction of Interacting Hands via Collision-Aware
Factorized Refinements [96.40125818594952]
We make the first attempt to reconstruct 3D interacting hands from monocular single RGB images.
Our method can generate 3D hand meshes with both precise 3D poses and minimal collisions.
arXiv Detail & Related papers (2021-11-01T08:24:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.