HiFECap: Monocular High-Fidelity and Expressive Capture of Human
Performances
- URL: http://arxiv.org/abs/2210.05665v1
- Date: Tue, 11 Oct 2022 17:57:45 GMT
- Title: HiFECap: Monocular High-Fidelity and Expressive Capture of Human
Performances
- Authors: Yue Jiang, Marc Habermann, Vladislav Golyanik, Christian Theobalt
- Abstract summary: HiFECap simultaneously captures human pose, clothing, facial expression, and hands just from a single RGB video.
Our method also captures high-frequency details, such as deforming wrinkles on the clothes, better than the previous works.
- Score: 84.7225785061814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Monocular 3D human performance capture is indispensable for many applications
in computer graphics and vision for enabling immersive experiences. However,
detailed capture of humans requires tracking of multiple aspects, including the
skeletal pose, the dynamic surface, which includes clothing, hand gestures as
well as facial expressions. No existing monocular method allows joint tracking
of all these components. To this end, we propose HiFECap, a new neural human
performance capture approach, which simultaneously captures human pose,
clothing, facial expression, and hands just from a single RGB video. We
demonstrate that our proposed network architecture, the carefully designed
training strategy, and the tight integration of parametric face and hand models
to a template mesh enable the capture of all these individual aspects.
Importantly, our method also captures high-frequency details, such as deforming
wrinkles on the clothes, better than the previous works. Furthermore, we show
that HiFECap outperforms the state-of-the-art human performance capture
approaches qualitatively and quantitatively while for the first time capturing
all aspects of the human.
Related papers
- WonderHuman: Hallucinating Unseen Parts in Dynamic 3D Human Reconstruction [51.22641018932625]
We present WonderHuman to reconstruct dynamic human avatars from a monocular video for high-fidelity novel view synthesis.
Our method achieves SOTA performance in producing photorealistic renderings from the given monocular video.
arXiv Detail & Related papers (2025-02-03T04:43:41Z) - Human Multi-View Synthesis from a Single-View Model:Transferred Body and Face Representations [7.448124739584319]
We propose an innovative framework that leverages transferred body and facial representations for multi-view human synthesis.
Specifically, we use a single-view model pretrained on a large-scale human dataset to develop a multi-view body representation.
Our approach outperforms the current state-of-the-art methods, achieving superior performance in multi-view human synthesis.
arXiv Detail & Related papers (2024-12-04T04:02:17Z) - MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild [32.6521941706907]
We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos.
We first define a layered neural representation for the entire scene, composited by individual human and background models.
We learn the layered neural representation from videos via our layer-wise differentiable volume rendering.
arXiv Detail & Related papers (2024-06-03T17:59:57Z) - VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation [79.99551055245071]
We propose VividPose, an end-to-end pipeline that ensures superior temporal stability.
An identity-aware appearance controller integrates additional facial information without compromising other appearance details.
A geometry-aware pose controller utilizes both dense rendering maps from SMPL-X and sparse skeleton maps.
VividPose exhibits superior generalization capabilities on our proposed in-the-wild dataset.
arXiv Detail & Related papers (2024-05-28T13:18:32Z) - VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis [40.869862603815875]
VLOGGER is a method for audio-driven human video generation from a single input image.
We use a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls.
We show applications in video editing and personalization.
arXiv Detail & Related papers (2024-03-13T17:59:02Z) - CapHuman: Capture Your Moments in Parallel Universes [60.06408546134581]
We present a new framework named CapHuman.
CapHuman encodes identity features and then learns to align them into the latent space.
We introduce the 3D facial prior to equip our model with control over the human head in a flexible and 3D-consistent manner.
arXiv Detail & Related papers (2024-02-01T14:41:59Z) - GHuNeRF: Generalizable Human NeRF from a Monocular Video [63.741714198481354]
GHuNeRF learns a generalizable human NeRF model from a monocular video.
We validate our approach on the widely-used ZJU-MoCap dataset.
arXiv Detail & Related papers (2023-08-31T09:19:06Z) - Human Performance Capture from Monocular Video in the Wild [50.34917313325813]
We propose a method capable of capturing the dynamic 3D human shape from a monocular video featuring challenging body poses.
Our method outperforms state-of-the-art methods on an in-the-wild human video dataset 3DPW.
arXiv Detail & Related papers (2021-11-29T16:32:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.