Learning to Disambiguate Strongly Interacting Hands via Probabilistic
Per-pixel Part Segmentation
- URL: http://arxiv.org/abs/2107.00434v1
- Date: Thu, 1 Jul 2021 13:28:02 GMT
- Title: Learning to Disambiguate Strongly Interacting Hands via Probabilistic
Per-pixel Part Segmentation
- Authors: Zicong Fan, Adrian Spurr, Muhammed Kocabas, Siyu Tang, Michael J.
Black, Otmar Hilliges
- Abstract summary: Self-similarity, and the resulting ambiguities in assigning pixel observations to the respective hands, is a major cause of the final 3D pose error.
We propose DIGIT, a novel method for estimating the 3D poses of two interacting hands from a single monocular image.
We experimentally show that the proposed approach achieves new state-of-the-art performance on the InterHand2.6M dataset.
- Score: 84.28064034301445
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In natural conversation and interaction, our hands often overlap or are in
contact with each other. Due to the homogeneous appearance of hands, this makes
estimating the 3D pose of interacting hands from images difficult. In this
paper we demonstrate that self-similarity, and the resulting ambiguities in
assigning pixel observations to the respective hands and their parts, is a
major cause of the final 3D pose error. Motivated by this insight, we propose
DIGIT, a novel method for estimating the 3D poses of two interacting hands from
a single monocular image. The method consists of two interwoven branches that
process the input imagery into a per-pixel semantic part segmentation mask and
a visual feature volume. In contrast to prior work, we do not decouple the
segmentation from the pose estimation stage, but rather leverage the per-pixel
probabilities directly in the downstream pose estimation task. To do so, the
part probabilities are merged with the visual features and processed via
fully-convolutional layers. We experimentally show that the proposed approach
achieves new state-of-the-art performance on the InterHand2.6M dataset for both
single and interacting hands across all metrics. We provide detailed ablation
studies to demonstrate the efficacy of our method and to provide insights into
how the modelling of pixel ownership affects single and interacting hand pose
estimation. Our code will be released for research purposes.
Related papers
- SHARP: Segmentation of Hands and Arms by Range using Pseudo-Depth for Enhanced Egocentric 3D Hand Pose Estimation and Action Recognition [5.359837526794863]
Hand pose represents key information for action recognition in the egocentric perspective.
We propose to improve egocentric 3D hand pose estimation based on RGB frames only by using pseudo-depth images.
arXiv Detail & Related papers (2024-08-19T14:30:29Z) - HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning [1.4515751892711464]
We propose an end-to-end solution that addresses the 2D-3D correspondence problem.
This solution enables back-propagation from camera space outputs to the rest of the network through a new differentiable global positioning module.
We validate the effectiveness of our framework in evaluations against several baselines and state-of-the-art approaches.
arXiv Detail & Related papers (2024-07-22T17:59:01Z) - 3D Interacting Hand Pose Estimation by Hand De-occlusion and Removal [85.30756038989057]
Estimating 3D interacting hand pose from a single RGB image is essential for understanding human actions.
We propose to decompose the challenging interacting hand pose estimation task and estimate the pose of each hand separately.
Experiments show that the proposed method significantly outperforms previous state-of-the-art interacting hand pose estimation approaches.
arXiv Detail & Related papers (2022-07-22T13:04:06Z) - Monocular 3D Reconstruction of Interacting Hands via Collision-Aware
Factorized Refinements [96.40125818594952]
We make the first attempt to reconstruct 3D interacting hands from monocular single RGB images.
Our method can generate 3D hand meshes with both precise 3D poses and minimal collisions.
arXiv Detail & Related papers (2021-11-01T08:24:10Z) - H2O: Two Hands Manipulating Objects for First Person Interaction
Recognition [70.46638409156772]
We present a comprehensive framework for egocentric interaction recognition using markerless 3D annotations of two hands manipulating objects.
Our method produces annotations of the 3D pose of two hands and the 6D pose of the manipulated objects, along with their interaction labels for each frame.
Our dataset, called H2O (2 Hands and Objects), provides synchronized multi-view RGB-D images, interaction labels, object classes, ground-truth 3D poses for left & right hands, 6D object poses, ground-truth camera poses, object meshes and scene point clouds.
arXiv Detail & Related papers (2021-04-22T17:10:42Z) - Leveraging Photometric Consistency over Time for Sparsely Supervised
Hand-Object Reconstruction [118.21363599332493]
We present a method to leverage photometric consistency across time when annotations are only available for a sparse subset of frames in a video.
Our model is trained end-to-end on color images to jointly reconstruct hands and objects in 3D by inferring their poses.
We achieve state-of-the-art results on 3D hand-object reconstruction benchmarks and demonstrate that our approach allows us to improve the pose estimation accuracy.
arXiv Detail & Related papers (2020-04-28T12:03:14Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.