Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation
- URL: http://arxiv.org/abs/2201.07786v1
- Date: Wed, 19 Jan 2022 18:54:41 GMT
- Title: Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation
- Authors: Xian Liu, Yinghao Xu, Qianyi Wu, Hang Zhou, Wayne Wu, Bolei Zhou
- Abstract summary: We propose Semantic-aware Speaking Portrait NeRF (SSP-NeRF), which creates delicate audio-driven portraits using one unified set of NeRF.
We first propose a Semantic-Aware Dynamic Ray Sampling module with an additional parsing branch that facilitates audio-driven volume rendering.
To enable portrait rendering in one unified neural radiance field, a Torso Deformation module is designed to stabilize the large-scale non-rigid torso motions.
- Score: 61.8546794105462
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Animating high-fidelity video portrait with speech audio is crucial for
virtual reality and digital entertainment. While most previous studies rely on
accurate explicit structural information, recent works explore the implicit
scene representation of Neural Radiance Fields (NeRF) for realistic generation.
In order to capture the inconsistent motions as well as the semantic difference
between human head and torso, some work models them via two individual sets of
NeRF, leading to unnatural results. In this work, we propose Semantic-aware
Speaking Portrait NeRF (SSP-NeRF), which creates delicate audio-driven
portraits using one unified set of NeRF. The proposed model can handle the
detailed local facial semantics and the global head-torso relationship through
two semantic-aware modules. Specifically, we first propose a Semantic-Aware
Dynamic Ray Sampling module with an additional parsing branch that facilitates
audio-driven volume rendering. Moreover, to enable portrait rendering in one
unified neural radiance field, a Torso Deformation module is designed to
stabilize the large-scale non-rigid torso motions. Extensive evaluations
demonstrate that our proposed approach renders more realistic video portraits
compared to previous methods. Project page:
https://alvinliu0.github.io/projects/SSP-NeRF
Related papers
- AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head
Synthesis [42.203900183584665]
We present Audio Enhanced Neural Radiance Field (AE-NeRF) to generate realistic portraits of a new speaker with fewshot dataset.
AE-NeRF surpasses the state-of-the-art on image fidelity, audio-lip synchronization, and generalization ability, even in limited training set or training iterations.
arXiv Detail & Related papers (2023-12-18T04:14:38Z) - GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face
Synthesis [62.297513028116576]
GeneFace is a general and high-fidelity NeRF-based talking face generation method.
A head-aware torso-NeRF is proposed to eliminate the head-torso problem.
arXiv Detail & Related papers (2023-01-31T05:56:06Z) - Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial
Decomposition [61.6677901687009]
We propose an efficient NeRF-based framework that enables real-time synthesizing of talking portraits.
Our method can generate realistic and audio-lips synchronized talking portrait videos.
arXiv Detail & Related papers (2022-11-22T16:03:11Z) - Reconstructing Personalized Semantic Facial NeRF Models From Monocular
Video [27.36067360218281]
We present a novel semantic model for human head defined with neural radiance field.
The 3D-consistent head model consist of a set of disentangled and interpretable bases, and can be driven by low-dimensional expression coefficients.
With a short monocular RGB video as input, our method can construct the subject's semantic facial NeRF model with only ten to twenty minutes.
arXiv Detail & Related papers (2022-10-12T11:56:52Z) - Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head
Synthesis [90.43371339871105]
We propose Dynamic Facial Radiance Fields (DFRF) for few-shot talking head synthesis.
DFRF conditions face radiance field on 2D appearance images to learn the face prior.
Experiments show DFRF can synthesize natural and high-quality audio-driven talking head videos for novel identities with only 40k iterations.
arXiv Detail & Related papers (2022-07-24T16:46:03Z) - PIRenderer: Controllable Portrait Image Generation via Semantic Neural
Rendering [56.762094966235566]
A Portrait Image Neural Renderer is proposed to control the face motions with the parameters of three-dimensional morphable face models.
The proposed model can generate photo-realistic portrait images with accurate movements according to intuitive modifications.
Our model can generate coherent videos with convincing movements from only a single reference image and a driving audio stream.
arXiv Detail & Related papers (2021-09-17T07:24:16Z) - AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis [55.24336227884039]
We present a novel framework to generate high-fidelity talking head video.
We use neural scene representation networks to bridge the gap between audio input and video output.
Our framework can (1) produce high-fidelity and natural results, and (2) support free adjustment of audio signals, viewing directions, and background images.
arXiv Detail & Related papers (2021-03-20T02:58:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.