Synthesizing Photorealistic Virtual Humans Through Cross-modal
Disentanglement
- URL: http://arxiv.org/abs/2209.01320v2
- Date: Fri, 24 Mar 2023 01:39:51 GMT
- Title: Synthesizing Photorealistic Virtual Humans Through Cross-modal
Disentanglement
- Authors: Siddarth Ravichandran, Ond\v{r}ej Texler, Dimitar Dinev, Hyun Jae Kang
- Abstract summary: We propose an end-to-end framework for synthesizing high-quality virtual human faces capable of speaking with accurate lip motion.
Our method runs in real-time, and is able to deliver superior results compared to the current state-of-the-art.
- Score: 0.8959668207214765
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Over the last few decades, many aspects of human life have been enhanced with
virtual domains, from the advent of digital assistants such as Amazon's Alexa
and Apple's Siri to the latest metaverse efforts of the rebranded Meta. These
trends underscore the importance of generating photorealistic visual depictions
of humans. This has led to the rapid growth of so-called deepfake and
talking-head generation methods in recent years. Despite their impressive
results and popularity, they usually lack certain qualitative aspects such as
texture quality, lips synchronization, or resolution, and practical aspects
such as the ability to run in real-time. To allow for virtual human avatars to
be used in practical scenarios, we propose an end-to-end framework for
synthesizing high-quality virtual human faces capable of speaking with accurate
lip motion with a special emphasis on performance. We introduce a novel network
utilizing visemes as an intermediate audio representation and a novel data
augmentation strategy employing a hierarchical image synthesis approach that
allows disentanglement of the different modalities used to control the global
head motion. Our method runs in real-time, and is able to deliver superior
results compared to the current state-of-the-art.
Related papers
- RealisHuman: A Two-Stage Approach for Refining Malformed Human Parts in Generated Images [24.042262870735087]
We propose a novel post-processing solution named RealisHuman.
It generates realistic human parts, such as hands or faces, using the original parts as references.
Second, it seamlessly integrates the rectified human parts back into their corresponding positions.
arXiv Detail & Related papers (2024-09-05T16:02:11Z) - CapHuman: Capture Your Moments in Parallel Universes [60.06408546134581]
We present a new framework named CapHuman.
CapHuman encodes identity features and then learns to align them into the latent space.
We introduce the 3D facial prior to equip our model with control over the human head in a flexible and 3D-consistent manner.
arXiv Detail & Related papers (2024-02-01T14:41:59Z) - From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands.
We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures.
Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z) - Self-supervised novel 2D view synthesis of large-scale scenes with
efficient multi-scale voxel carving [77.07589573960436]
We introduce an efficient multi-scale voxel carving method to generate novel views of real scenes.
Our final high-resolution output is efficiently self-trained on data automatically generated by the voxel carving module.
We demonstrate the effectiveness of our method on highly complex and large-scale scenes in real environments.
arXiv Detail & Related papers (2023-06-26T13:57:05Z) - Novel View Synthesis of Humans using Differentiable Rendering [50.57718384229912]
We present a new approach for synthesizing novel views of people in new poses.
Our synthesis makes use of diffuse Gaussian primitives that represent the underlying skeletal structure of a human.
Rendering these primitives gives results in a high-dimensional latent image, which is then transformed into an RGB image by a decoder network.
arXiv Detail & Related papers (2023-03-28T10:48:33Z) - HDHumans: A Hybrid Approach for High-fidelity Digital Humans [107.19426606778808]
HDHumans is the first method for HD human character synthesis that jointly produces an accurate and temporally coherent 3D deforming surface.
Our method is carefully designed to achieve a synergy between classical surface deformation and neural radiance fields (NeRF)
arXiv Detail & Related papers (2022-10-21T14:42:11Z) - Human Pose Manipulation and Novel View Synthesis using Differentiable
Rendering [46.04980667824064]
We present a new approach for synthesizing novel views of people in new poses.
Our synthesis makes use of diffuse Gaussian primitives that represent the underlying skeletal structure of a human.
Rendering these primitives gives results in a high-dimensional latent image, which is then transformed into an RGB image by a decoder network.
arXiv Detail & Related papers (2021-11-24T19:00:07Z) - Style and Pose Control for Image Synthesis of Humans from a Single
Monocular View [78.6284090004218]
StylePoseGAN is a non-controllable generator to accept conditioning of pose and appearance separately.
Our network can be trained in a fully supervised way with human images to disentangle pose, appearance and body parts.
StylePoseGAN achieves state-of-the-art image generation fidelity on common perceptual metrics.
arXiv Detail & Related papers (2021-02-22T18:50:47Z) - Learning Compositional Radiance Fields of Dynamic Human Heads [13.272666180264485]
We propose a novel compositional 3D representation that combines the best of previous methods to produce both higher-resolution and faster results.
Differentiable volume rendering is employed to compute photo-realistic novel views of the human head and upper body.
Our approach achieves state-of-the-art results for synthesizing novel views of dynamic human heads and the upper body.
arXiv Detail & Related papers (2020-12-17T22:19:27Z) - A Neural Lip-Sync Framework for Synthesizing Photorealistic Virtual News
Anchors [8.13692293541489]
Lip sync has emerged as a promising technique for generating mouth movements from audio signals.
This paper presents a novel lip-sync framework specially designed for producing high-fidelity virtual news anchors.
arXiv Detail & Related papers (2020-02-20T12:26:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.