Related papers: DeltaDorsal: Enhancing Hand Pose Estimation with Dorsal Features in Egocentric Views

DeltaDorsal: Enhancing Hand Pose Estimation with Dorsal Features in Egocentric Views

URL: http://arxiv.org/abs/2601.15516v2
Date: Mon, 26 Jan 2026 18:45:41 GMT
Title: DeltaDorsal: Enhancing Hand Pose Estimation with Dorsal Features in Egocentric Views
Authors: William Huang, Siyou Pei, Leyi Zou, Eric J. Gonzalez, Ishan Chatterjee, Yang Zhang,
Abstract summary: We introduce a dual-stream delta encoder that learns pose by contrasting features from a dynamic hand with a baseline relaxed position.<n>Our method reduces the Mean Per Joint Angle Error (MPJAE) by 18% in self-occluded scenarios.
Score: 11.698905867819837
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The proliferation of XR devices has made egocentric hand pose estimation a vital task, yet this perspective is inherently challenged by frequent finger occlusions. To address this, we propose a novel approach that leverages the rich information in dorsal hand skin deformation, unlocked by recent advances in dense visual featurizers. We introduce a dual-stream delta encoder that learns pose by contrasting features from a dynamic hand with a baseline relaxed position. Our evaluation demonstrates that, using only cropped dorsal images, our method reduces the Mean Per Joint Angle Error (MPJAE) by 18% in self-occluded scenarios (fingers >= 50% occluded) compared to state-of-the-art techniques that depend on the whole hand's geometry and large model backbones. Consequently, our method not only enhances the reliability of downstream tasks like index finger pinch and tap estimation in occluded scenarios but also unlocks new interaction paradigms, such as detecting isometric force for a surface "click" without visible movement while minimizing model size.

Related papers

HandDAGT: A Denoising Adaptive Graph Transformer for 3D Hand Pose Estimation [15.606904161622017]
This paper proposes the Denoising Adaptive Graph Transformer, HandDAGT, for hand pose estimation. It incorporates a novel attention mechanism to adaptively weigh the contribution of kinematic correspondence and local geometric features for the estimation of specific keypoints. Experimental results show that the proposed model significantly outperforms the existing methods on four challenging hand pose benchmark datasets.
arXiv Detail & Related papers (2024-07-30T04:53:35Z)
HandDiff: 3D Hand Pose Estimation with Diffusion on Image-Point Cloud [60.47544798202017]
Hand pose estimation is a critical task in various human-computer interaction applications. This paper proposes HandDiff, a diffusion-based hand pose estimation model that iteratively denoises accurate hand pose conditioned on hand-shaped image-point clouds. Experimental results demonstrate that the proposed HandDiff significantly outperforms the existing approaches on four challenging hand pose benchmark datasets.
arXiv Detail & Related papers (2024-04-04T02:15:16Z)
D-SCo: Dual-Stream Conditional Diffusion for Monocular Hand-Held Object Reconstruction [74.49121940466675]
We introduce centroid-fixed dual-stream conditional diffusion for monocular hand-held object reconstruction. First, to avoid the object centroid from deviating, we utilize a novel hand-constrained centroid fixing paradigm. Second, we introduce a dual-stream denoiser to semantically and geometrically model hand-object interactions.
arXiv Detail & Related papers (2023-11-23T20:14:50Z)
DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation [16.32910684198013]
We present DiffPose, a novel diffusion architecture that formulates video-based human pose estimation as a conditional heatmap generation problem. We show two unique characteristics from DiffPose on pose estimation task: (i) the ability to combine multiple sets of pose estimates to improve prediction accuracy, particularly for challenging joints, and (ii) the ability to adjust the number of iterative steps for feature refinement without retraining the model.
arXiv Detail & Related papers (2023-07-31T14:00:23Z)
HandNeRF: Neural Radiance Fields for Animatable Interacting Hands [122.32855646927013]
We propose a novel framework to reconstruct accurate appearance and geometry with neural radiance fields (NeRF) for interacting hands. We conduct extensive experiments to verify the merits of our proposed HandNeRF and report a series of state-of-the-art results.
arXiv Detail & Related papers (2023-03-24T06:19:19Z)
Deformer: Dynamic Fusion Transformer for Robust Hand Pose Estimation [59.3035531612715]
Existing methods often struggle to generate plausible hand poses when the hand is heavily occluded or blurred. In videos, the movements of the hand allow us to observe various parts of the hand that may be occluded or blurred in a single frame. We propose the Deformer: a framework that implicitly reasons about the relationship between hand parts within the same image.
arXiv Detail & Related papers (2023-03-09T02:24:30Z)
Explicit Occlusion Reasoning for Multi-person 3D Human Pose Estimation [33.86986028882488]
Occlusion poses a great threat to monocular multi-person 3D human pose estimation due to large variability in terms of the shape, appearance, and position of occluders. Existing methods try to handle occlusion with pose priors/constraints, data augmentation, or implicit reasoning. We develop a method to explicitly model this process that significantly improves bottom-up multi-person human pose estimation.
arXiv Detail & Related papers (2022-07-29T22:12:50Z)
3D Interacting Hand Pose Estimation by Hand De-occlusion and Removal [85.30756038989057]
Estimating 3D interacting hand pose from a single RGB image is essential for understanding human actions. We propose to decompose the challenging interacting hand pose estimation task and estimate the pose of each hand separately. Experiments show that the proposed method significantly outperforms previous state-of-the-art interacting hand pose estimation approaches.
arXiv Detail & Related papers (2022-07-22T13:04:06Z)
Real-time Pose and Shape Reconstruction of Two Interacting Hands With a Single Depth Camera [79.41374930171469]
We present a novel method for real-time pose and shape reconstruction of two strongly interacting hands. Our approach combines an extensive list of favorable properties, namely it is marker-less. We show state-of-the-art results in scenes that exceed the complexity level demonstrated by previous work.
arXiv Detail & Related papers (2021-06-15T11:39:49Z)
Generative Model-Based Loss to the Rescue: A Method to Overcome Annotation Errors for Depth-Based Hand Pose Estimation [76.12736932610163]
We propose to use a model-based generative loss for training hand pose estimators on depth images based on a volumetric hand model. This additional loss allows training of a hand pose estimator that accurately infers the entire set of 21 hand keypoints while only using supervision for 6 easy-to-annotate keypoints (fingertips and wrist).
arXiv Detail & Related papers (2020-07-06T21:24:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.