Related papers: AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis

AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis

URL: http://arxiv.org/abs/2312.10921v1
Date: Mon, 18 Dec 2023 04:14:38 GMT
Title: AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis
Authors: Dongze Li, Kang Zhao, Wei Wang, Bo Peng, Yingya Zhang, Jing Dong and Tieniu Tan
Abstract summary: We present Audio Enhanced Neural Radiance Field (AE-NeRF) to generate realistic portraits of a new speaker with fewshot dataset. AE-NeRF surpasses the state-of-the-art on image fidelity, audio-lip synchronization, and generalization ability, even in limited training set or training iterations.
Score: 42.203900183584665
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Audio-driven talking head synthesis is a promising topic with wide applications in digital human, film making and virtual reality. Recent NeRF-based approaches have shown superiority in quality and fidelity compared to previous studies. However, when it comes to few-shot talking head generation, a practical scenario where only few seconds of talking video is available for one identity, two limitations emerge: 1) they either have no base model, which serves as a facial prior for fast convergence, or ignore the importance of audio when building the prior; 2) most of them overlook the degree of correlation between different face regions and audio, e.g., mouth is audio related, while ear is audio independent. In this paper, we present Audio Enhanced Neural Radiance Field (AE-NeRF) to tackle the above issues, which can generate realistic portraits of a new speaker with fewshot dataset. Specifically, we introduce an Audio Aware Aggregation module into the feature fusion stage of the reference scheme, where the weight is determined by the similarity of audio between reference and target image. Then, an Audio-Aligned Face Generation strategy is proposed to model the audio related and audio independent regions respectively, with a dual-NeRF framework. Extensive experiments have shown AE-NeRF surpasses the state-of-the-art on image fidelity, audio-lip synchronization, and generalization ability, even in limited training set or training iterations.

Related papers

KAN-Based Fusion of Dual-Domain for Audio-Driven Facial Landmarks Generation [8.111156834055821]
Reconstructing a talking face using audio significantly contributes to fields such as education, healthcare, online conversations, virtual assistants, and virtual reality. Recently, researchers have proposed a new approach of constructing the entire face, including face pose, neck, and shoulders. We propose the KFusion of Dual-Domain model, a robust model that generates landmarks from audio.
arXiv Detail & Related papers (2024-09-09T05:20:02Z)
S^3D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis [14.437741528053504]
We design a Single-Shot Speech-Driven Radiance Field (S3D-NeRF) method to tackle the three difficulties: learning a representative appearance feature for each identity, modeling motion of different face regions with audio, and keeping the temporal consistency of the lip area. Our S3D-NeRF surpasses previous arts on both video fidelity and audio-lip synchronization.
arXiv Detail & Related papers (2024-08-18T03:59:57Z)
NeRF-AD: Neural Radiance Field with Attention-based Disentanglement for Talking Face Synthesis [2.5387791616637587]
Talking face synthesis driven by audio is one of the current research hotspots in the fields of multidimensional signal processing and multimedia. NeRF has recently been brought to this research field in order to enhance the realism and 3D effect of the generated faces. This paper proposes a talking face synthesis method via NeRF with attention-based disentanglement (NeRF-AD)
arXiv Detail & Related papers (2024-01-23T08:54:10Z)
GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis [62.297513028116576]
GeneFace is a general and high-fidelity NeRF-based talking face generation method. A head-aware torso-NeRF is proposed to eliminate the head-torso problem.
arXiv Detail & Related papers (2023-01-31T05:56:06Z)
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z)
LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture. We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z)
Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation [61.8546794105462]
We propose Semantic-aware Speaking Portrait NeRF (SSP-NeRF), which creates delicate audio-driven portraits using one unified set of NeRF. We first propose a Semantic-Aware Dynamic Ray Sampling module with an additional parsing branch that facilitates audio-driven volume rendering. To enable portrait rendering in one unified neural radiance field, a Torso Deformation module is designed to stabilize the large-scale non-rigid torso motions.
arXiv Detail & Related papers (2022-01-19T18:54:41Z)
AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis [55.24336227884039]
We present a novel framework to generate high-fidelity talking head video. We use neural scene representation networks to bridge the gap between audio input and video output. Our framework can (1) produce high-fidelity and natural results, and (2) support free adjustment of audio signals, viewing directions, and background images.
arXiv Detail & Related papers (2021-03-20T02:58:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.