SPEC: Seeing People in the Wild with an Estimated Camera
- URL: http://arxiv.org/abs/2110.00620v1
- Date: Fri, 1 Oct 2021 19:05:18 GMT
- Title: SPEC: Seeing People in the Wild with an Estimated Camera
- Authors: Muhammed Kocabas, Chun-Hao P. Huang, Joachim Tesch, Lea M\"uller,
Otmar Hilliges, and Michael J. Black
- Abstract summary: We introduce SPEC, the first in-the-wild 3D HPS method that estimates the perspective camera from a single image.
We train a neural network to estimate the field of view, camera pitch, and roll an input image.
We then train a novel network that rolls the camera calibration to the image features and uses these together to regress 3D body shape and pose.
- Score: 64.85791231401684
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the lack of camera parameter information for in-the-wild images,
existing 3D human pose and shape (HPS) estimation methods make several
simplifying assumptions: weak-perspective projection, large constant focal
length, and zero camera rotation. These assumptions often do not hold and we
show, quantitatively and qualitatively, that they cause errors in the
reconstructed 3D shape and pose. To address this, we introduce SPEC, the first
in-the-wild 3D HPS method that estimates the perspective camera from a single
image and employs this to reconstruct 3D human bodies more accurately. %regress
3D human bodies. First, we train a neural network to estimate the field of
view, camera pitch, and roll given an input image. We employ novel losses that
improve the calibration accuracy over previous work. We then train a novel
network that concatenates the camera calibration to the image features and uses
these together to regress 3D body shape and pose. SPEC is more accurate than
the prior art on the standard benchmark (3DPW) as well as two new datasets with
more challenging camera views and varying focal lengths. Specifically, we
create a new photorealistic synthetic dataset (SPEC-SYN) with ground truth 3D
bodies and a novel in-the-wild dataset (SPEC-MTP) with calibration and
high-quality reference bodies. Both qualitative and quantitative analysis
confirm that knowing camera parameters during inference regresses better human
bodies. Code and datasets are available for research purposes at
https://spec.is.tue.mpg.de.
Related papers
- SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views [36.02533658048349]
We propose a novel method, SpaRP, to reconstruct a 3D textured mesh and estimate the relative camera poses for sparse-view images.
SpaRP distills knowledge from 2D diffusion models and finetunes them to implicitly deduce the 3D spatial relationships between the sparse views.
It requires only about 20 seconds to produce a textured mesh and camera poses for the input views.
arXiv Detail & Related papers (2024-08-19T17:53:10Z) - Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot [22.848563931757962]
We present Multi-HMR, a strong sigle-shot model for multi-person 3D human mesh recovery from a single RGB image.
Predictions encompass the whole body, including hands and facial expressions, using the SMPL-X parametric model.
We show that incorporating it into the training data further enhances predictions, particularly for hands.
arXiv Detail & Related papers (2024-02-22T16:05:13Z) - Unsupervised Multi-Person 3D Human Pose Estimation From 2D Poses Alone [4.648549457266638]
We present one of the first studies investigating the feasibility of unsupervised multi-person 2D-3D pose estimation.
Our method involves independently lifting each subject's 2D pose to 3D, before combining them in a shared 3D coordinate system.
This by itself enables us to retrieve an accurate 3D reconstruction of their poses.
arXiv Detail & Related papers (2023-09-26T11:42:56Z) - Zolly: Zoom Focal Length Correctly for Perspective-Distorted Human Mesh
Reconstruction [66.10717041384625]
Zolly is the first 3DHMR method focusing on perspective-distorted images.
We propose a new camera model and a novel 2D representation, termed distortion image, which describes the 2D dense distortion scale of the human body.
We extend two real-world datasets tailored for this task, all containing perspective-distorted human images.
arXiv Detail & Related papers (2023-03-24T04:22:41Z) - Scene-Aware 3D Multi-Human Motion Capture from a Single Camera [83.06768487435818]
We consider the problem of estimating the 3D position of multiple humans in a scene as well as their body shape and articulation from a single RGB video recorded with a static camera.
We leverage recent advances in computer vision using large-scale pre-trained models for a variety of modalities, including 2D body joints, joint angles, normalized disparity maps, and human segmentation masks.
In particular, we estimate the scene depth and unique person scale from normalized disparity predictions using the 2D body joints and joint angles.
arXiv Detail & Related papers (2023-01-12T18:01:28Z) - Learning Temporal 3D Human Pose Estimation with Pseudo-Labels [3.0954251281114513]
We present a simple, yet effective, approach for self-supervised 3D human pose estimation.
We rely on triangulating 2D body pose estimates of a multiple-view camera system.
Our method achieves state-of-the-art performance in the Human3.6M and MPI-INF-3DHP benchmarks.
arXiv Detail & Related papers (2021-10-14T17:40:45Z) - MetaPose: Fast 3D Pose from Multiple Views without 3D Supervision [72.5863451123577]
We show how to train a neural model that can perform accurate 3D pose and camera estimation.
Our method outperforms both classical bundle adjustment and weakly-supervised monocular 3D baselines.
arXiv Detail & Related papers (2021-08-10T18:39:56Z) - Synthetic Training for Monocular Human Mesh Recovery [100.38109761268639]
This paper aims to estimate 3D mesh of multiple body parts with large-scale differences from a single RGB image.
The main challenge is lacking training data that have complete 3D annotations of all body parts in 2D images.
We propose a depth-to-scale (D2S) projection to incorporate the depth difference into the projection function to derive per-joint scale variants.
arXiv Detail & Related papers (2020-10-27T03:31:35Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.