GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance
- URL: http://arxiv.org/abs/2312.07385v1
- Date: Tue, 12 Dec 2023 16:00:55 GMT
- Title: GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance
- Authors: Haiming Zhang, Zhihao Yuan, Chaoda Zheng, Xu Yan, Baoyuan Wang,
Guanbin Li, Song Wu, Shuguang Cui, Zhen Li
- Abstract summary: GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
- Score: 83.43852715997596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although existing speech-driven talking face generation methods achieve
significant progress, they are far from real-world application due to the
avatar-specific training demand and unstable lip movements. To address the
above issues, we propose the GSmoothFace, a novel two-stage generalized talking
face generation model guided by a fine-grained 3d face model, which can
synthesize smooth lip dynamics while preserving the speaker's identity. Our
proposed GSmoothFace model mainly consists of the Audio to Expression
Prediction (A2EP) module and the Target Adaptive Face Translation (TAFT)
module. Specifically, we first develop the A2EP module to predict expression
parameters synchronized with the driven speech. It uses a transformer to
capture the long-term audio context and learns the parameters from the
fine-grained 3D facial vertices, resulting in accurate and smooth
lip-synchronization performance. Afterward, the well-designed TAFT module,
empowered by Morphology Augmented Face Blending (MAFB), takes the predicted
expression parameters and target video as inputs to modify the facial region of
the target video without distorting the background content. The TAFT
effectively exploits the identity appearance and background context in the
target video, which makes it possible to generalize to different speakers
without retraining. Both quantitative and qualitative experiments confirm the
superiority of our method in terms of realism, lip synchronization, and visual
quality. See the project page for code, data, and request pre-trained models:
https://zhanghm1995.github.io/GSmoothFace.
Related papers
- MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes [74.82911268630463]
Talking face generation (TFG) aims to animate a target identity's face to create realistic talking videos.
MimicTalk exploits the rich knowledge from a NeRF-based person-agnostic generic model for improving the efficiency and robustness of personalized TFG.
Experiments show that our MimicTalk surpasses previous baselines regarding video quality, efficiency, and expressiveness.
arXiv Detail & Related papers (2024-10-09T10:12:37Z) - RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework.
In the first component, we consider both identity and intra-personal variation features related to speaking lip movements.
In the second component, we design a lightweight facial identity alignment (FIA) module.
This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z) - Controllable Talking Face Generation by Implicit Facial Keypoints Editing [6.036277153327655]
We present ControlTalk, a talking face generation method to control face expression deformation based on driven audio.
Our experiments show that our method is superior to state-of-the-art performance on widely used benchmarks, including HDTF and MEAD.
arXiv Detail & Related papers (2024-06-05T02:54:46Z) - PMMTalk: Speech-Driven 3D Facial Animation from Complementary Pseudo
Multi-modal Features [22.31865247379668]
Speech-driven 3D facial animation has improved a lot recently.
Most related works only utilize acoustic modality and neglect the influence of visual and textual cues.
We present a novel framework, namely PMMTalk, using complementary Pseudo Multi-Modal features for improving the accuracy of facial animation.
arXiv Detail & Related papers (2023-12-05T14:12:38Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking
Face Generation [71.73912454164834]
A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency.
NeRF has become a popular technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video.
We propose GeneFace++ to handle these challenges by utilizing the rendering pitch contour as an auxiliary feature and introducing a temporal loss in the facial motion prediction process.
arXiv Detail & Related papers (2023-05-01T12:24:09Z) - FaceFormer: Speech-Driven 3D Facial Animation with Transformers [46.8780140220063]
Speech-driven 3D facial animation is challenging due to the complex geometry of human faces and the limited availability of 3D audio-visual data.
We propose a Transformer-based autoregressive model, FaceFormer, which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes.
arXiv Detail & Related papers (2021-12-10T04:21:59Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.