Cascaded Diffusion Framework for Probabilistic Coarse-to-Fine Hand Pose Estimation
- URL: http://arxiv.org/abs/2510.00527v1
- Date: Wed, 01 Oct 2025 05:19:15 GMT
- Title: Cascaded Diffusion Framework for Probabilistic Coarse-to-Fine Hand Pose Estimation
- Authors: Taeyun Woo, Jinah Park, Tae-Kyun Kim,
- Abstract summary: We propose a coarse-to-fine cascaded diffusion framework that combines probabilistic modeling with cascaded refinement.<n>By training Mesh LDM with diverse joint hypotheses in a learned latent space, our framework learns distribution-aware joint-mesh relationships.<n> Experiments on FreiHAND and HO3Dv2 demonstrate that our method achieves state-of-the-art performance while effectively modeling pose distributions.
- Score: 11.992963268744438
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Deterministic models for 3D hand pose reconstruction, whether single-staged or cascaded, struggle with pose ambiguities caused by self-occlusions and complex hand articulations. Existing cascaded approaches refine predictions in a coarse-to-fine manner but remain deterministic and cannot capture pose uncertainties. Recent probabilistic methods model pose distributions yet are restricted to single-stage estimation, which often fails to produce accurate 3D reconstructions without refinement. To address these limitations, we propose a coarse-to-fine cascaded diffusion framework that combines probabilistic modeling with cascaded refinement. The first stage is a joint diffusion model that samples diverse 3D joint hypotheses, and the second stage is a Mesh Latent Diffusion Model (Mesh LDM) that reconstructs a 3D hand mesh conditioned on a joint sample. By training Mesh LDM with diverse joint hypotheses in a learned latent space, our framework learns distribution-aware joint-mesh relationships and robust hand priors. Furthermore, the cascaded design mitigates the difficulty of directly mapping 2D images to dense 3D poses, enhancing accuracy through sequential refinement. Experiments on FreiHAND and HO3Dv2 demonstrate that our method achieves state-of-the-art performance while effectively modeling pose distributions.
Related papers
- LieHMR: Autoregressive Human Mesh Recovery with $SO(3)$ Diffusion [29.608043710963162]
We tackle the problem of Human Mesh Recovery from a single RGB image.<n>While recovering 3D human pose from 2D observations is inherently ambiguous, most existing approaches have regressed a single deterministic output.<n>We propose a novel approach that models well-aligned distribution to 2D observations.
arXiv Detail & Related papers (2025-09-30T03:50:56Z) - Learning Correlation-aware Aleatoric Uncertainty for 3D Hand Pose Estimation [29.05126213133674]
We introduce aleatoric uncertainty modeling into the 3D hand pose estimation framework.<n>We propose a novel parameterization that leverages a single linear layer to capture intrinsic correlations among hand joints.<n>Our experiments demonstrate that our parameterization for uncertainty modeling outperforms existing approaches.
arXiv Detail & Related papers (2025-09-01T08:31:01Z) - Learning to Align and Refine: A Foundation-to-Diffusion Framework for Occlusion-Robust Two-Hand Reconstruction [50.952228546326516]
Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures.<n>Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts.<n>We propose a dual-stage Foundation-to-Diffusion framework that precisely align 2D prior guidance from vision foundation models.
arXiv Detail & Related papers (2025-03-22T14:42:27Z) - HandDiff: 3D Hand Pose Estimation with Diffusion on Image-Point Cloud [60.47544798202017]
Hand pose estimation is a critical task in various human-computer interaction applications.
This paper proposes HandDiff, a diffusion-based hand pose estimation model that iteratively denoises accurate hand pose conditioned on hand-shaped image-point clouds.
Experimental results demonstrate that the proposed HandDiff significantly outperforms the existing approaches on four challenging hand pose benchmark datasets.
arXiv Detail & Related papers (2024-04-04T02:15:16Z) - InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion [53.90516061351706]
We present InterHandGen, a novel framework that learns the generative prior of two-hand interaction.
For sampling, we combine anti-penetration and synthesis-free guidance to enable plausible generation.
Our method significantly outperforms baseline generative models in terms of plausibility and diversity.
arXiv Detail & Related papers (2024-03-26T06:35:55Z) - D-SCo: Dual-Stream Conditional Diffusion for Monocular Hand-Held Object Reconstruction [74.49121940466675]
We introduce centroid-fixed dual-stream conditional diffusion for monocular hand-held object reconstruction.
First, to avoid the object centroid from deviating, we utilize a novel hand-constrained centroid fixing paradigm.
Second, we introduce a dual-stream denoiser to semantically and geometrically model hand-object interactions.
arXiv Detail & Related papers (2023-11-23T20:14:50Z) - A Probabilistic Attention Model with Occlusion-aware Texture Regression
for 3D Hand Reconstruction from a Single RGB Image [5.725477071353354]
Deep learning approaches have shown promising results in 3D hand reconstruction from a single RGB image.
We propose a novel probabilistic model to achieve the robustness of model-based approaches.
We demonstrate the flexibility of the proposed probabilistic model to be trained in both supervised and weakly-supervised scenarios.
arXiv Detail & Related papers (2023-04-27T16:02:32Z) - DiffPose: Multi-hypothesis Human Pose Estimation using Diffusion models [5.908471365011943]
We propose emphDiffPose, a conditional diffusion model that predicts multiple hypotheses for a given input image.
We show that DiffPose slightly improves upon the state of the art for multi-hypothesis pose estimation for simple poses and outperforms it by a large margin for highly ambiguous poses.
arXiv Detail & Related papers (2022-11-29T18:55:13Z) - HandFlow: Quantifying View-Dependent 3D Ambiguity in Two-Hand
Reconstruction with Normalizing Flow [73.7895717883622]
We explicitly model the distribution of plausible reconstructions in a conditional normalizing flow framework.
We show that explicit ambiguity modeling is better-suited for this challenging problem.
arXiv Detail & Related papers (2022-10-04T15:42:22Z) - Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose
Estimation [70.32536356351706]
We introduce MRP-Net that constitutes a common deep network backbone with two output heads subscribing to two diverse configurations.
We derive suitable measures to quantify prediction uncertainty at both pose and joint level.
We present a comprehensive evaluation of the proposed approach and demonstrate state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2022-03-29T07:14:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.