End-to-end Weakly-supervised Single-stage Multiple 3D Hand Mesh
Reconstruction from a Single RGB Image
- URL: http://arxiv.org/abs/2204.08154v3
- Date: Sat, 6 May 2023 08:38:24 GMT
- Title: End-to-end Weakly-supervised Single-stage Multiple 3D Hand Mesh
Reconstruction from a Single RGB Image
- Authors: Jinwei Ren, Jianke Zhu, and Jialiang Zhang
- Abstract summary: We propose a single-stage pipeline for multi-hand reconstruction.
Specifically, we design a multi-head auto-encoder structure, where each head network shares the same feature map and outputs the hand center, pose and texture.
Our method outperforms the state-of-the-art model-based methods in both weakly-supervised and fully-supervised manners.
- Score: 9.238322841389994
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we consider the challenging task of simultaneously locating
and recovering multiple hands from a single 2D image. Previous studies either
focus on single hand reconstruction or solve this problem in a multi-stage way.
Moreover, the conventional two-stage pipeline firstly detects hand areas, and
then estimates 3D hand pose from each cropped patch. To reduce the
computational redundancy in preprocessing and feature extraction, for the first
time, we propose a concise but efficient single-stage pipeline for multi-hand
reconstruction. Specifically, we design a multi-head auto-encoder structure,
where each head network shares the same feature map and outputs the hand
center, pose and texture, respectively. Besides, we adopt a weakly-supervised
scheme to alleviate the burden of expensive 3D real-world data annotations. To
this end, we propose a series of losses optimized by a stage-wise training
scheme, where a multi-hand dataset with 2D annotations is generated based on
the publicly available single hand datasets. In order to further improve the
accuracy of the weakly supervised model, we adopt several feature consistency
constraints in both single and multiple hand settings. Specifically, the
keypoints of each hand estimated from local features should be consistent with
the re-projected points predicted from global features. Extensive experiments
on public benchmarks including FreiHAND, HO3D, InterHand2.6M and RHD
demonstrate that our method outperforms the state-of-the-art model-based
methods in both weakly-supervised and fully-supervised manners. The code and
models are available at {https://github.com/zijinxuxu/SMHR}.
Related papers
- WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild [53.288327629960364]
We present a data-driven pipeline for efficient multi-hand reconstruction in the wild.
The proposed pipeline is composed of two components: a real-time fully convolutional hand localization and a high-fidelity transformer-based 3D hand reconstruction model.
Our approach outperforms previous methods in both efficiency and accuracy on popular 2D and 3D benchmarks.
arXiv Detail & Related papers (2024-09-18T18:46:51Z) - Two Hands Are Better Than One: Resolving Hand to Hand Intersections via Occupancy Networks [33.9893684177763]
Self-occlusions and finger articulation pose a significant problem to estimation.
We exploit an occupancy network that represents the hand's volume as a continuous manifold.
We design an intersection loss function to minimize the likelihood of hand-to-point intersections.
arXiv Detail & Related papers (2024-04-08T11:32:26Z) - Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose
Estimation [70.32536356351706]
We introduce MRP-Net that constitutes a common deep network backbone with two output heads subscribing to two diverse configurations.
We derive suitable measures to quantify prediction uncertainty at both pose and joint level.
We present a comprehensive evaluation of the proposed approach and demonstrate state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2022-03-29T07:14:58Z) - Consistent 3D Hand Reconstruction in Video via self-supervised Learning [67.55449194046996]
We present a method for reconstructing accurate and consistent 3D hands from a monocular video.
detected 2D hand keypoints and the image texture provide important cues about the geometry and texture of the 3D hand.
We propose $rm S2HAND$, a self-supervised 3D hand reconstruction model.
arXiv Detail & Related papers (2022-01-24T09:44:11Z) - Multi-initialization Optimization Network for Accurate 3D Human Pose and
Shape Estimation [75.44912541912252]
We propose a three-stage framework named Multi-Initialization Optimization Network (MION)
In the first stage, we strategically select different coarse 3D reconstruction candidates which are compatible with the 2D keypoints of input sample.
In the second stage, we design a mesh refinement transformer (MRT) to respectively refine each coarse reconstruction result via a self-attention mechanism.
Finally, a Consistency Estimation Network (CEN) is proposed to find the best result from mutiple candidates by evaluating if the visual evidence in RGB image matches a given 3D reconstruction.
arXiv Detail & Related papers (2021-12-24T02:43:58Z) - HandFoldingNet: A 3D Hand Pose Estimation Network Using
Multiscale-Feature Guided Folding of a 2D Hand Skeleton [4.1954750695245835]
This paper proposes HandFoldingNet, an accurate and efficient hand pose estimator.
The proposed model utilizes a folding-based decoder that folds a given 2D hand skeleton into the corresponding joint coordinates.
Experimental results show that the proposed model outperforms the existing methods on three hand pose benchmark datasets.
arXiv Detail & Related papers (2021-08-12T05:52:44Z) - Synthetic Training for Monocular Human Mesh Recovery [100.38109761268639]
This paper aims to estimate 3D mesh of multiple body parts with large-scale differences from a single RGB image.
The main challenge is lacking training data that have complete 3D annotations of all body parts in 2D images.
We propose a depth-to-scale (D2S) projection to incorporate the depth difference into the projection function to derive per-joint scale variants.
arXiv Detail & Related papers (2020-10-27T03:31:35Z) - Monocular, One-stage, Regression of Multiple 3D People [105.3143785498094]
We propose to Regress all meshes in a One-stage fashion for Multiple 3D People (termed ROMP)
Our method simultaneously predicts a Body Center heatmap and a Mesh map, which can jointly describe the 3D body mesh on the pixel level.
Compared with state-of-the-art methods, ROMP superior performance on the challenging multi-person benchmarks.
arXiv Detail & Related papers (2020-08-27T17:21:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.