GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking
Face Generation
- URL: http://arxiv.org/abs/2305.00787v1
- Date: Mon, 1 May 2023 12:24:09 GMT
- Title: GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking
Face Generation
- Authors: Zhenhui Ye, Jinzheng He, Ziyue Jiang, Rongjie Huang, Jiawei Huang,
Jinglin Liu, Yi Ren, Xiang Yin, Zejun Ma, Zhou Zhao
- Abstract summary: A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency.
NeRF has become a popular technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video.
We propose GeneFace++ to handle these challenges by utilizing the rendering pitch contour as an auxiliary feature and introducing a temporal loss in the facial motion prediction process.
- Score: 71.73912454164834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating talking person portraits with arbitrary speech audio is a crucial
problem in the field of digital human and metaverse. A modern talking face
generation method is expected to achieve the goals of generalized audio-lip
synchronization, good video quality, and high system efficiency. Recently,
neural radiance field (NeRF) has become a popular rendering technique in this
field since it could achieve high-fidelity and 3D-consistent talking face
generation with a few-minute-long training video. However, there still exist
several challenges for NeRF-based methods: 1) as for the lip synchronization,
it is hard to generate a long facial motion sequence of high temporal
consistency and audio-lip accuracy; 2) as for the video quality, due to the
limited data used to train the renderer, it is vulnerable to out-of-domain
input condition and produce bad rendering results occasionally; 3) as for the
system efficiency, the slow training and inference speed of the vanilla NeRF
severely obstruct its usage in real-world applications. In this paper, we
propose GeneFace++ to handle these challenges by 1) utilizing the pitch contour
as an auxiliary feature and introducing a temporal loss in the facial motion
prediction process; 2) proposing a landmark locally linear embedding method to
regulate the outliers in the predicted motion sequence to avoid robustness
issues; 3) designing a computationally efficient NeRF-based motion-to-video
renderer to achieves fast training and real-time inference. With these
settings, GeneFace++ becomes the first NeRF-based method that achieves stable
and real-time talking face generation with generalized audio-lip
synchronization. Extensive experiments show that our method outperforms
state-of-the-art baselines in terms of subjective and objective evaluation.
Video samples are available at https://genefaceplusplus.github.io .
Related papers
- RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework.
In the first component, we consider both identity and intra-personal variation features related to speaking lip movements.
In the second component, we design a lightweight facial identity alignment (FIA) module.
This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face
Synthesis [62.297513028116576]
GeneFace is a general and high-fidelity NeRF-based talking face generation method.
A head-aware torso-NeRF is proposed to eliminate the head-torso problem.
arXiv Detail & Related papers (2023-01-31T05:56:06Z) - StableFace: Analyzing and Improving Motion Stability for Talking Face
Generation [38.25025849434312]
We study the motion jittering problem based on a state-of-the-art pipeline that uses 3D face representations to bridge the input audio and output video.
We find that several issues can lead to jitters in synthesized talking face video.
arXiv Detail & Related papers (2022-08-29T16:56:35Z) - FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech
Synthesis [77.06890315052563]
We propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency.
Experiments show that our model achieves $19.76times$ speedup for audio generation compared with the current autoregressive model on input sequences of 3 seconds.
arXiv Detail & Related papers (2022-07-08T10:10:39Z) - FaceFormer: Speech-Driven 3D Facial Animation with Transformers [46.8780140220063]
Speech-driven 3D facial animation is challenging due to the complex geometry of human faces and the limited availability of 3D audio-visual data.
We propose a Transformer-based autoregressive model, FaceFormer, which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes.
arXiv Detail & Related papers (2021-12-10T04:21:59Z) - LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from
Video using Pose and Lighting Normalization [4.43316916502814]
We present a video-based learning framework for animating personalized 3D talking faces from audio.
We introduce two training-time data normalizations that significantly improve data sample efficiency.
Our method outperforms contemporary state-of-the-art audio-driven video reenactment benchmarks in terms of realism, lip-sync and visual quality scores.
arXiv Detail & Related papers (2021-06-08T08:56:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.