Multi-initialization Optimization Network for Accurate 3D Human Pose and
Shape Estimation
- URL: http://arxiv.org/abs/2112.12917v1
- Date: Fri, 24 Dec 2021 02:43:58 GMT
- Title: Multi-initialization Optimization Network for Accurate 3D Human Pose and
Shape Estimation
- Authors: Zhiwei Liu, Xiangyu Zhu, Lu Yang, Xiang Yan, Ming Tang, Zhen Lei,
Guibo Zhu, Xuetao Feng, Yan Wang, Jinqiao Wang
- Abstract summary: We propose a three-stage framework named Multi-Initialization Optimization Network (MION)
In the first stage, we strategically select different coarse 3D reconstruction candidates which are compatible with the 2D keypoints of input sample.
In the second stage, we design a mesh refinement transformer (MRT) to respectively refine each coarse reconstruction result via a self-attention mechanism.
Finally, a Consistency Estimation Network (CEN) is proposed to find the best result from mutiple candidates by evaluating if the visual evidence in RGB image matches a given 3D reconstruction.
- Score: 75.44912541912252
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 3D human pose and shape recovery from a monocular RGB image is a challenging
task. Existing learning based methods highly depend on weak supervision
signals, e.g. 2D and 3D joint location, due to the lack of in-the-wild paired
3D supervision. However, considering the 2D-to-3D ambiguities existed in these
weak supervision labels, the network is easy to get stuck in local optima when
trained with such labels. In this paper, we reduce the ambituity by optimizing
multiple initializations. Specifically, we propose a three-stage framework
named Multi-Initialization Optimization Network (MION). In the first stage, we
strategically select different coarse 3D reconstruction candidates which are
compatible with the 2D keypoints of input sample. Each coarse reconstruction
can be regarded as an initialization leads to one optimization branch. In the
second stage, we design a mesh refinement transformer (MRT) to respectively
refine each coarse reconstruction result via a self-attention mechanism.
Finally, a Consistency Estimation Network (CEN) is proposed to find the best
result from mutiple candidates by evaluating if the visual evidence in RGB
image matches a given 3D reconstruction. Experiments demonstrate that our
Multi-Initialization Optimization Network outperforms existing 3D mesh based
methods on multiple public benchmarks.
Related papers
- Sampling is Matter: Point-guided 3D Human Mesh Reconstruction [0.0]
This paper presents a simple yet powerful method for 3D human mesh reconstruction from a single RGB image.
Experimental results on benchmark datasets show that the proposed method efficiently improves the performance of 3D human mesh reconstruction.
arXiv Detail & Related papers (2023-04-19T08:45:26Z) - CheckerPose: Progressive Dense Keypoint Localization for Object Pose
Estimation with Graph Neural Network [66.24726878647543]
Estimating the 6-DoF pose of a rigid object from a single RGB image is a crucial yet challenging task.
Recent studies have shown the great potential of dense correspondence-based solutions.
We propose a novel pose estimation algorithm named CheckerPose, which improves on three main aspects.
arXiv Detail & Related papers (2023-03-29T17:30:53Z) - End-to-end Weakly-supervised Single-stage Multiple 3D Hand Mesh
Reconstruction from a Single RGB Image [9.238322841389994]
We propose a single-stage pipeline for multi-hand reconstruction.
Specifically, we design a multi-head auto-encoder structure, where each head network shares the same feature map and outputs the hand center, pose and texture.
Our method outperforms the state-of-the-art model-based methods in both weakly-supervised and fully-supervised manners.
arXiv Detail & Related papers (2022-04-18T03:57:14Z) - Permutation-Invariant Relational Network for Multi-person 3D Pose
Estimation [46.38290735670527]
Recovering multi-person 3D poses from a single RGB image is a severely ill-conditioned problem.
Recent works have shown promising results by simultaneously reasoning for different people but in all cases within a local neighborhood.
PI-Net introduces a self-attention block to reason for all people in the image at the same time and refine potentially noisy initial 3D poses.
In this paper, we model people interactions at a whole, independently of their number, and in a permutation-invariant manner building upon the Set Transformer.
arXiv Detail & Related papers (2022-04-11T07:23:54Z) - Multi-Modality Task Cascade for 3D Object Detection [22.131228757850373]
Many methods train two models in isolation and use simple feature concatenation to represent 3D sensor data.
We propose a novel Multi-Modality Task Cascade network (MTC-RCNN) that leverages 3D box proposals to improve 2D segmentation predictions.
We show that including a 2D network between two stages of 3D modules significantly improves both 2D and 3D task performance.
arXiv Detail & Related papers (2021-07-08T17:55:01Z) - Soft Expectation and Deep Maximization for Image Feature Detection [68.8204255655161]
We propose SEDM, an iterative semi-supervised learning process that flips the question and first looks for repeatable 3D points, then trains a detector to localize them in image space.
Our results show that this new model trained using SEDM is able to better localize the underlying 3D points in a scene.
arXiv Detail & Related papers (2021-04-21T00:35:32Z) - Synthetic Training for Monocular Human Mesh Recovery [100.38109761268639]
This paper aims to estimate 3D mesh of multiple body parts with large-scale differences from a single RGB image.
The main challenge is lacking training data that have complete 3D annotations of all body parts in 2D images.
We propose a depth-to-scale (D2S) projection to incorporate the depth difference into the projection function to derive per-joint scale variants.
arXiv Detail & Related papers (2020-10-27T03:31:35Z) - Weakly Supervised Generative Network for Multiple 3D Human Pose
Hypotheses [74.48263583706712]
3D human pose estimation from a single image is an inverse problem due to the inherent ambiguity of the missing depth.
We propose a weakly supervised deep generative network to address the inverse problem.
arXiv Detail & Related papers (2020-08-13T09:26:01Z) - Implicit Functions in Feature Space for 3D Shape Reconstruction and
Completion [53.885984328273686]
Implicit Feature Networks (IF-Nets) deliver continuous outputs, can handle multiple topologies, and complete shapes for missing or sparse input data.
IF-Nets clearly outperform prior work in 3D object reconstruction in ShapeNet, and obtain significantly more accurate 3D human reconstructions.
arXiv Detail & Related papers (2020-03-03T11:14:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.