Putting People in their Place: Monocular Regression of 3D People in
Depth
- URL: http://arxiv.org/abs/2112.08274v1
- Date: Wed, 15 Dec 2021 17:08:17 GMT
- Title: Putting People in their Place: Monocular Regression of 3D People in
Depth
- Authors: Yu Sun, Wu Liu, Qian Bao, Yili Fu, Tao Mei, Michael J. Black
- Abstract summary: Given an image with multiple people, our goal is to directly regress the pose and shape of all the people as well as their relative depth.
We develop a novel method to infer the poses and depth of multiple people in a single image.
We exploit a 3D body model space that lets BEV infer shapes from infants to adults.
- Score: 93.70564469697095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given an image with multiple people, our goal is to directly regress the pose
and shape of all the people as well as their relative depth. Inferring the
depth of a person in an image, however, is fundamentally ambiguous without
knowing their height. This is particularly problematic when the scene contains
people of very different sizes, e.g. from infants to adults. To solve this, we
need several things. First, we develop a novel method to infer the poses and
depth of multiple people in a single image. While previous work that estimates
multiple people does so by reasoning in the image plane, our method, called
BEV, adds an additional imaginary Bird's-Eye-View representation to explicitly
reason about depth. BEV reasons simultaneously about body centers in the image
and in depth and, by combing these, estimates 3D body position. Unlike prior
work, BEV is a single-shot method that is end-to-end differentiable. Second,
height varies with age, making it impossible to resolve depth without also
estimating the age of people in the image. To do so, we exploit a 3D body model
space that lets BEV infer shapes from infants to adults. Third, to train BEV,
we need a new dataset. Specifically, we create a "Relative Human" (RH) dataset
that includes age labels and relative depth relationships between the people in
the images. Extensive experiments on RH and AGORA demonstrate the effectiveness
of the model and training scheme. BEV outperforms existing methods on depth
reasoning, child shape estimation, and robustness to occlusion. The code and
dataset will be released for research purposes.
Related papers
- Zolly: Zoom Focal Length Correctly for Perspective-Distorted Human Mesh
Reconstruction [66.10717041384625]
Zolly is the first 3DHMR method focusing on perspective-distorted images.
We propose a new camera model and a novel 2D representation, termed distortion image, which describes the 2D dense distortion scale of the human body.
We extend two real-world datasets tailored for this task, all containing perspective-distorted human images.
arXiv Detail & Related papers (2023-03-24T04:22:41Z) - Learning to Estimate 3D Human Pose from Point Cloud [13.27496851711973]
We propose a deep human pose network for 3D pose estimation by taking the point cloud data as input data to model the surface of complex human structures.
Our experiments on two public datasets show that our approach achieves higher accuracy than previous state-of-art methods.
arXiv Detail & Related papers (2022-12-25T14:22:01Z) - Scene-aware Egocentric 3D Human Pose Estimation [72.57527706631964]
Egocentric 3D human pose estimation with a single head-mounted fisheye camera has recently attracted attention due to its numerous applications in virtual and augmented reality.
Existing methods still struggle in challenging poses where the human body is highly occluded or is closely interacting with the scene.
We propose a scene-aware egocentric pose estimation method that guides the prediction of the egocentric pose with scene constraints.
arXiv Detail & Related papers (2022-12-20T21:35:39Z) - Towards Accurate Reconstruction of 3D Scene Shape from A Single
Monocular Image [91.71077190961688]
We propose a two-stage framework that first predicts depth up to an unknown scale and shift from a single monocular image.
We then exploits 3D point cloud data to predict the depth shift and the camera's focal length that allow us to recover 3D scene shapes.
We test our depth model on nine unseen datasets and achieve state-of-the-art performance on zero-shot evaluation.
arXiv Detail & Related papers (2022-08-28T16:20:14Z) - Dual networks based 3D Multi-Person Pose Estimation from Monocular Video [42.01876518017639]
Multi-person 3D pose estimation is more challenging than single pose estimation.
Existing top-down and bottom-up approaches to pose estimation suffer from detection errors.
We propose the integration of top-down and bottom-up approaches to exploit their strengths.
arXiv Detail & Related papers (2022-05-02T08:53:38Z) - Body Size and Depth Disambiguation in Multi-Person Reconstruction from
Single Images [44.96633481495911]
We address the problem of multi-person 3D body pose and shape estimation from a single image.
We devise a novel optimization scheme that learns the appropriate body scale and relative camera pose, by enforcing the feet of all people to remain on the ground floor.
A thorough evaluation on MuPoTS-3D and 3DPW datasets demonstrates that our approach is able to robustly estimate the body translation and shape of multiple people while retrieving their spatial arrangement.
arXiv Detail & Related papers (2021-11-02T20:42:41Z) - Learning Realistic Human Reposing using Cyclic Self-Supervision with 3D
Shape, Pose, and Appearance Consistency [55.94908688207493]
We propose a self-supervised framework named SPICE that closes the image quality gap with supervised methods.
The key insight enabling self-supervision is to exploit 3D information about the human body in several ways.
SPICE achieves state-of-the-art performance on the DeepFashion dataset.
arXiv Detail & Related papers (2021-10-11T17:48:50Z) - AGORA: Avatars in Geography Optimized for Regression Analysis [35.22486186509372]
AGORA is a synthetic dataset with high realism and highly accurate ground truth.
We create reference 3D poses and body shapes by fitting the SMPL-X body model (with face and hands) to the 3D scans.
We evaluate existing state-of-the-art methods for 3D human pose estimation on this dataset and find that most methods perform poorly on images of children.
arXiv Detail & Related papers (2021-04-29T20:33:25Z) - Coherent Reconstruction of Multiple Humans from a Single Image [68.3319089392548]
In this work, we address the problem of multi-person 3D pose estimation from a single image.
A typical regression approach in the top-down setting of this problem would first detect all humans and then reconstruct each one of them independently.
Our goal is to train a single network that learns to avoid these problems and generate a coherent 3D reconstruction of all the humans in the scene.
arXiv Detail & Related papers (2020-06-15T17:51:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.