Human Performance Capture from Monocular Video in the Wild
- URL: http://arxiv.org/abs/2111.14672v2
- Date: Tue, 30 Nov 2021 16:03:36 GMT
- Title: Human Performance Capture from Monocular Video in the Wild
- Authors: Chen Guo, Xu Chen, Jie Song and Otmar Hilliges
- Abstract summary: We propose a method capable of capturing the dynamic 3D human shape from a monocular video featuring challenging body poses.
Our method outperforms state-of-the-art methods on an in-the-wild human video dataset 3DPW.
- Score: 50.34917313325813
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Capturing the dynamically deforming 3D shape of clothed human is essential
for numerous applications, including VR/AR, autonomous driving, and
human-computer interaction. Existing methods either require a highly
specialized capturing setup, such as expensive multi-view imaging systems, or
they lack robustness to challenging body poses. In this work, we propose a
method capable of capturing the dynamic 3D human shape from a monocular video
featuring challenging body poses, without any additional input. We first build
a 3D template human model of the subject based on a learned regression model.
We then track this template model's deformation under challenging body
articulations based on 2D image observations. Our method outperforms
state-of-the-art methods on an in-the-wild human video dataset 3DPW. Moreover,
we demonstrate its efficacy in robustness and generalizability on videos from
iPER datasets.
Related papers
- BAG: Body-Aligned 3D Wearable Asset Generation [59.7545477546307]
Bag is a Body-aligned Asset Generation method to output 3D wearable asset that can be automatically dressed on given 3D human bodies.
Results demonstrate significant advantages over existing methods in terms of image prompt-following capability, shape diversity, and shape quality.
arXiv Detail & Related papers (2025-01-27T16:23:45Z) - DAViD: Modeling Dynamic Affordance of 3D Objects using Pre-trained Video Diffusion Models [9.103840202072336]
We present a method to learn the 3D dynamic affordance from synthetically generated 2D videos.
Specifically, we propose a pipeline that first generates 2D HOI videos from the 3D object and then lifts them into 3D to generate 4D HOI samples.
arXiv Detail & Related papers (2025-01-14T18:59:59Z) - DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses [57.17501809717155]
We present DreamDance, a novel method for animating human images using only skeleton pose sequences as conditional inputs.
Our key insight is that human images naturally exhibit multiple levels of correlation.
We construct the TikTok-Dance5K dataset, comprising 5K high-quality dance videos with detailed frame annotations.
arXiv Detail & Related papers (2024-11-30T08:42:13Z) - MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild [32.6521941706907]
We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos.
We first define a layered neural representation for the entire scene, composited by individual human and background models.
We learn the layered neural representation from videos via our layer-wise differentiable volume rendering.
arXiv Detail & Related papers (2024-06-03T17:59:57Z) - Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance [25.346255905155424]
We introduce a methodology for human image animation by leveraging a 3D human parametric model within a latent diffusion framework.
By representing the 3D human parametric model as the motion guidance, we can perform parametric shape alignment of the human body between the reference image and the source video motion.
Our approach also exhibits superior generalization capabilities on the proposed in-the-wild dataset.
arXiv Detail & Related papers (2024-03-21T18:52:58Z) - Scene-Aware 3D Multi-Human Motion Capture from a Single Camera [83.06768487435818]
We consider the problem of estimating the 3D position of multiple humans in a scene as well as their body shape and articulation from a single RGB video recorded with a static camera.
We leverage recent advances in computer vision using large-scale pre-trained models for a variety of modalities, including 2D body joints, joint angles, normalized disparity maps, and human segmentation masks.
In particular, we estimate the scene depth and unique person scale from normalized disparity predictions using the 2D body joints and joint angles.
arXiv Detail & Related papers (2023-01-12T18:01:28Z) - HiFECap: Monocular High-Fidelity and Expressive Capture of Human
Performances [84.7225785061814]
HiFECap simultaneously captures human pose, clothing, facial expression, and hands just from a single RGB video.
Our method also captures high-frequency details, such as deforming wrinkles on the clothes, better than the previous works.
arXiv Detail & Related papers (2022-10-11T17:57:45Z) - Self-Supervised 3D Human Pose Estimation in Static Video Via Neural
Rendering [5.568218439349004]
Inferring 3D human pose from 2D images is a challenging and long-standing problem in the field of computer vision.
We present preliminary results for a method to estimate 3D pose from 2D video containing a single person.
arXiv Detail & Related papers (2022-10-10T09:24:07Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z) - Weakly Supervised 3D Human Pose and Shape Reconstruction with
Normalizing Flows [43.89097619822221]
We present semi-supervised and self-supervised models that support training and good generalization in real-world images and video.
Our formulation is based on kinematic latent normalizing flow representations and dynamics, as well as differentiable, semantic body part alignment loss functions.
In extensive experiments using 3D motion capture datasets like CMU, Human3.6M, 3DPW, or AMASS, we show that the proposed methods outperform the state of the art.
arXiv Detail & Related papers (2020-03-23T16:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.