Monocular, One-stage, Regression of Multiple 3D People
- URL: http://arxiv.org/abs/2008.12272v4
- Date: Thu, 16 Sep 2021 11:41:15 GMT
- Title: Monocular, One-stage, Regression of Multiple 3D People
- Authors: Yu Sun, Qian Bao, Wu Liu, Yili Fu, Michael J. Black, Tao Mei
- Abstract summary: We propose to Regress all meshes in a One-stage fashion for Multiple 3D People (termed ROMP)
Our method simultaneously predicts a Body Center heatmap and a Mesh map, which can jointly describe the 3D body mesh on the pixel level.
Compared with state-of-the-art methods, ROMP superior performance on the challenging multi-person benchmarks.
- Score: 105.3143785498094
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper focuses on the regression of multiple 3D people from a single RGB
image. Existing approaches predominantly follow a multi-stage pipeline that
first detects people in bounding boxes and then independently regresses their
3D body meshes. In contrast, we propose to Regress all meshes in a One-stage
fashion for Multiple 3D People (termed ROMP). The approach is conceptually
simple, bounding box-free, and able to learn a per-pixel representation in an
end-to-end manner. Our method simultaneously predicts a Body Center heatmap and
a Mesh Parameter map, which can jointly describe the 3D body mesh on the pixel
level. Through a body-center-guided sampling process, the body mesh parameters
of all people in the image are easily extracted from the Mesh Parameter map.
Equipped with such a fine-grained representation, our one-stage framework is
free of the complex multi-stage process and more robust to occlusion. Compared
with state-of-the-art methods, ROMP achieves superior performance on the
challenging multi-person benchmarks, including 3DPW and CMU Panoptic.
Experiments on crowded/occluded datasets demonstrate the robustness under
various types of occlusion. The released code is the first real-time
implementation of monocular multi-person 3D mesh regression.
Related papers
- MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation [54.27399121779011]
We present MVD-Fusion: a method for single-view 3D inference via generative modeling of multi-view-consistent RGB-D images.
We show that our approach can yield more accurate synthesis compared to recent state-of-the-art, including distillation-based 3D inference and prior multi-view generation methods.
arXiv Detail & Related papers (2024-04-04T17:59:57Z) - Sampling is Matter: Point-guided 3D Human Mesh Reconstruction [0.0]
This paper presents a simple yet powerful method for 3D human mesh reconstruction from a single RGB image.
Experimental results on benchmark datasets show that the proposed method efficiently improves the performance of 3D human mesh reconstruction.
arXiv Detail & Related papers (2023-04-19T08:45:26Z) - Multi-Person 3D Pose and Shape Estimation via Inverse Kinematics and
Refinement [5.655207244072081]
Estimating 3D poses and shapes in the form of meshes from monocular RGB images is challenging.
We propose a coarse-to-fine pipeline that benefits from 1) inverse kinematics from the occlusion-robust 3D skeleton estimation.
We demonstrate the effectiveness of our method, outperforming state-of-the-arts on 3DPW, MuPoTS and AGORA datasets.
arXiv Detail & Related papers (2022-10-24T18:29:06Z) - Permutation-Invariant Relational Network for Multi-person 3D Pose
Estimation [46.38290735670527]
Recovering multi-person 3D poses from a single RGB image is a severely ill-conditioned problem.
Recent works have shown promising results by simultaneously reasoning for different people but in all cases within a local neighborhood.
PI-Net introduces a self-attention block to reason for all people in the image at the same time and refine potentially noisy initial 3D poses.
In this paper, we model people interactions at a whole, independently of their number, and in a permutation-invariant manner building upon the Set Transformer.
arXiv Detail & Related papers (2022-04-11T07:23:54Z) - Multi-initialization Optimization Network for Accurate 3D Human Pose and
Shape Estimation [75.44912541912252]
We propose a three-stage framework named Multi-Initialization Optimization Network (MION)
In the first stage, we strategically select different coarse 3D reconstruction candidates which are compatible with the 2D keypoints of input sample.
In the second stage, we design a mesh refinement transformer (MRT) to respectively refine each coarse reconstruction result via a self-attention mechanism.
Finally, a Consistency Estimation Network (CEN) is proposed to find the best result from mutiple candidates by evaluating if the visual evidence in RGB image matches a given 3D reconstruction.
arXiv Detail & Related papers (2021-12-24T02:43:58Z) - Direct Multi-view Multi-person 3D Pose Estimation [138.48139701871213]
We present Multi-view Pose transformer (MvP) for estimating multi-person 3D poses from multi-view images.
MvP directly regresses the multi-person 3D poses in a clean and efficient way, without relying on intermediate tasks.
We show experimentally that our MvP model outperforms the state-of-the-art methods on several benchmarks while being much more efficient.
arXiv Detail & Related papers (2021-11-07T13:09:20Z) - VoxelTrack: Multi-Person 3D Human Pose Estimation and Tracking in the
Wild [98.69191256693703]
We present VoxelTrack for multi-person 3D pose estimation and tracking from a few cameras which are separated by wide baselines.
It employs a multi-branch network to jointly estimate 3D poses and re-identification (Re-ID) features for all people in the environment.
It outperforms the state-of-the-art methods by a large margin on three public datasets including Shelf, Campus and CMU Panoptic.
arXiv Detail & Related papers (2021-08-05T08:35:44Z) - Multi-View Multi-Person 3D Pose Estimation with Plane Sweep Stereo [71.59494156155309]
Existing approaches for multi-view 3D pose estimation explicitly establish cross-view correspondences to group 2D pose detections from multiple camera views.
We present our multi-view 3D pose estimation approach based on plane sweep stereo to jointly address the cross-view fusion and 3D pose reconstruction in a single shot.
arXiv Detail & Related papers (2021-04-06T03:49:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.