Deformer: Dynamic Fusion Transformer for Robust Hand Pose Estimation
- URL: http://arxiv.org/abs/2303.04991v2
- Date: Fri, 18 Aug 2023 01:20:35 GMT
- Title: Deformer: Dynamic Fusion Transformer for Robust Hand Pose Estimation
- Authors: Qichen Fu, Xingyu Liu, Ran Xu, Juan Carlos Niebles, Kris M. Kitani
- Abstract summary: Existing methods often struggle to generate plausible hand poses when the hand is heavily occluded or blurred.
In videos, the movements of the hand allow us to observe various parts of the hand that may be occluded or blurred in a single frame.
We propose the Deformer: a framework that implicitly reasons about the relationship between hand parts within the same image.
- Score: 59.3035531612715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurately estimating 3D hand pose is crucial for understanding how humans
interact with the world. Despite remarkable progress, existing methods often
struggle to generate plausible hand poses when the hand is heavily occluded or
blurred. In videos, the movements of the hand allow us to observe various parts
of the hand that may be occluded or blurred in a single frame. To adaptively
leverage the visual clue before and after the occlusion or blurring for robust
hand pose estimation, we propose the Deformer: a framework that implicitly
reasons about the relationship between hand parts within the same image
(spatial dimension) and different timesteps (temporal dimension). We show that
a naive application of the transformer self-attention mechanism is not
sufficient because motion blur or occlusions in certain frames can lead to
heavily distorted hand features and generate imprecise keys and queries. To
address this challenge, we incorporate a Dynamic Fusion Module into Deformer,
which predicts the deformation of the hand and warps the hand mesh predictions
from nearby frames to explicitly support the current frame estimation.
Furthermore, we have observed that errors are unevenly distributed across
different hand parts, with vertices around fingertips having disproportionately
higher errors than those around the palm. We mitigate this issue by introducing
a new loss function called maxMSE that automatically adjusts the weight of
every vertex to focus the model on critical hand parts. Extensive experiments
show that our method significantly outperforms state-of-the-art methods by 10%,
and is more robust to occlusions (over 14%).
Related papers
- Two Hands Are Better Than One: Resolving Hand to Hand Intersections via Occupancy Networks [33.9893684177763]
Self-occlusions and finger articulation pose a significant problem to estimation.
We exploit an occupancy network that represents the hand's volume as a continuous manifold.
We design an intersection loss function to minimize the likelihood of hand-to-point intersections.
arXiv Detail & Related papers (2024-04-08T11:32:26Z) - HandDiff: 3D Hand Pose Estimation with Diffusion on Image-Point Cloud [60.47544798202017]
Hand pose estimation is a critical task in various human-computer interaction applications.
This paper proposes HandDiff, a diffusion-based hand pose estimation model that iteratively denoises accurate hand pose conditioned on hand-shaped image-point clouds.
Experimental results demonstrate that the proposed HandDiff significantly outperforms the existing approaches on four challenging hand pose benchmark datasets.
arXiv Detail & Related papers (2024-04-04T02:15:16Z) - On the Utility of 3D Hand Poses for Action Recognition [36.64538554919222]
HandFormer is a novel multimodal transformer to efficiently model hand-object interactions.
We factorize hand modeling and represent each joint by its short-term trajectories.
We achieve new state-of-the-art performance on Assembly101 and H2O with significant improvements in egocentric action recognition.
arXiv Detail & Related papers (2024-03-14T18:52:34Z) - HandRefiner: Refining Malformed Hands in Generated Images by Diffusion-based Conditional Inpainting [72.95232302438207]
Diffusion models have achieved remarkable success in generating realistic images.
But they suffer from generating accurate human hands, such as incorrect finger counts or irregular shapes.
This paper introduces a lightweight post-processing solution called HandRefiner.
arXiv Detail & Related papers (2023-11-29T08:52:08Z) - Denoising Diffusion for 3D Hand Pose Estimation from Images [38.20064386142944]
This paper addresses the problem of 3D hand pose estimation from monocular images or sequences.
We present a novel end-to-end framework for 3D hand regression that employs diffusion models that have shown excellent ability to capture the distribution of data for generative purposes.
The proposed model provides state-of-the-art performance when lifting a 2D single-hand image to 3D.
arXiv Detail & Related papers (2023-08-18T12:57:22Z) - 3D Interacting Hand Pose Estimation by Hand De-occlusion and Removal [85.30756038989057]
Estimating 3D interacting hand pose from a single RGB image is essential for understanding human actions.
We propose to decompose the challenging interacting hand pose estimation task and estimate the pose of each hand separately.
Experiments show that the proposed method significantly outperforms previous state-of-the-art interacting hand pose estimation approaches.
arXiv Detail & Related papers (2022-07-22T13:04:06Z) - Monocular 3D Reconstruction of Interacting Hands via Collision-Aware
Factorized Refinements [96.40125818594952]
We make the first attempt to reconstruct 3D interacting hands from monocular single RGB images.
Our method can generate 3D hand meshes with both precise 3D poses and minimal collisions.
arXiv Detail & Related papers (2021-11-01T08:24:10Z) - Learning to Disambiguate Strongly Interacting Hands via Probabilistic
Per-pixel Part Segmentation [84.28064034301445]
Self-similarity, and the resulting ambiguities in assigning pixel observations to the respective hands, is a major cause of the final 3D pose error.
We propose DIGIT, a novel method for estimating the 3D poses of two interacting hands from a single monocular image.
We experimentally show that the proposed approach achieves new state-of-the-art performance on the InterHand2.6M dataset.
arXiv Detail & Related papers (2021-07-01T13:28:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.