Realistic Speech-to-Face Generation with Speech-Conditioned Latent
Diffusion Model with Face Prior
- URL: http://arxiv.org/abs/2310.03363v1
- Date: Thu, 5 Oct 2023 07:44:49 GMT
- Title: Realistic Speech-to-Face Generation with Speech-Conditioned Latent
Diffusion Model with Face Prior
- Authors: Jinting Wang, Li Liu, Jun Wang, Hei Victor Cheng
- Abstract summary: We propose a novel speech-to-face generation framework, which leverages a Speech-Conditioned Latent Diffusion Model, called SCLDM.
This is the first work to harness the exceptional modeling capabilities of diffusion models for speech-to-face generation.
We show that our method can produce more realistic face images while preserving the identity of the speaker better than state-of-the-art methods.
- Score: 13.198105709331617
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Speech-to-face generation is an intriguing area of research that focuses on
generating realistic facial images based on a speaker's audio speech. However,
state-of-the-art methods employing GAN-based architectures lack stability and
cannot generate realistic face images. To fill this gap, we propose a novel
speech-to-face generation framework, which leverages a Speech-Conditioned
Latent Diffusion Model, called SCLDM. To the best of our knowledge, this is the
first work to harness the exceptional modeling capabilities of diffusion models
for speech-to-face generation. Preserving the shared identity information
between speech and face is crucial in generating realistic results. Therefore,
we employ contrastive pre-training for both the speech encoder and the face
encoder. This pre-training strategy facilitates effective alignment between the
attributes of speech, such as age and gender, and the corresponding facial
characteristics in the face images. Furthermore, we tackle the challenge posed
by excessive diversity in the synthesis process caused by the diffusion model.
To overcome this challenge, we introduce the concept of residuals by
integrating a statistical face prior to the diffusion process. This addition
helps to eliminate the shared component across the faces and enhances the
subtle variations captured by the speech condition. Extensive quantitative,
qualitative, and user study experiments demonstrate that our method can produce
more realistic face images while preserving the identity of the speaker better
than state-of-the-art methods. Highlighting the notable enhancements, our
method demonstrates significant gains in all metrics on the AVSpeech dataset
and Voxceleb dataset, particularly noteworthy are the improvements of 32.17 and
32.72 on the cosine distance metric for the two datasets, respectively.
Related papers
- High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation.
We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw.
Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z) - RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework.
In the first component, we consider both identity and intra-personal variation features related to speaking lip movements.
In the second component, we design a lightweight facial identity alignment (FIA) module.
This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z) - Parametric Implicit Face Representation for Audio-Driven Facial
Reenactment [52.33618333954383]
We propose a novel audio-driven facial reenactment framework that is both controllable and can generate high-quality talking heads.
Specifically, our parametric implicit representation parameterizes the implicit representation with interpretable parameters of 3D face models.
Our method can generate more realistic results than previous methods with greater fidelity to the identities and talking styles of speakers.
arXiv Detail & Related papers (2023-06-13T07:08:22Z) - Pose-Controllable 3D Facial Animation Synthesis using Hierarchical
Audio-Vertex Attention [52.63080543011595]
A novel pose-controllable 3D facial animation synthesis method is proposed by utilizing hierarchical audio-vertex attention.
The proposed method can produce more realistic facial expressions and head posture movements.
arXiv Detail & Related papers (2023-02-24T09:36:31Z) - Expression-preserving face frontalization improves visually assisted
speech processing [35.647888055229956]
The main contribution of this paper is a frontalization methodology that preserves non-rigid facial deformations.
We show that the method, when incorporated into deep learning pipelines, improves word recognition and speech intelligibilty scores by a considerable margin.
arXiv Detail & Related papers (2022-04-06T13:22:24Z) - Attention-based Residual Speech Portrait Model for Speech to Face
Generation [14.299566923828719]
We propose a novel Attention-based Residual Speech Portrait Model (AR-SPM)
Our proposed model accelerates the convergence of training, outperforms the state-of-the-art in terms of quality of the generated face.
arXiv Detail & Related papers (2020-07-09T03:31:33Z) - Speech Fusion to Face: Bridging the Gap Between Human's Vocal
Characteristics and Facial Imaging [19.285149134711382]
Facial image generation based on vocal characteristics from speech is one of such important yet challenging tasks.
Existing solutions to the problem of speech2face renders limited image quality and fails to preserve facial similarity.
We propose Speech Fusion to Face, or SF2F, attempting to address the issue of facial image quality and the poor connection between vocal feature domain and modern image generation models.
arXiv Detail & Related papers (2020-06-10T15:19:31Z) - From Inference to Generation: End-to-end Fully Self-supervised
Generation of Human Face from Speech [20.41722156886205]
We propose a multi-modal learning framework that links the inference stage and generation stage.
The proposed method exploits the recent development of GANs techniques and generates the human face directly from the speech waveform.
Experimental results show that the proposed network can not only match the relationship between the human face and speech, but can also generate the high-quality human face sample conditioned on its speech.
arXiv Detail & Related papers (2020-04-13T09:01:49Z) - Dual-Attention GAN for Large-Pose Face Frontalization [59.689836951934694]
We present a novel Dual-Attention Generative Adversarial Network (DA-GAN) for photo-realistic face frontalization.
Specifically, a self-attention-based generator is introduced to integrate local features with their long-range dependencies.
A novel face-attention-based discriminator is applied to emphasize local features of face regions.
arXiv Detail & Related papers (2020-02-17T20:00:56Z) - Joint Deep Learning of Facial Expression Synthesis and Recognition [97.19528464266824]
We propose a novel joint deep learning of facial expression synthesis and recognition method for effective FER.
The proposed method involves a two-stage learning procedure. Firstly, a facial expression synthesis generative adversarial network (FESGAN) is pre-trained to generate facial images with different facial expressions.
In order to alleviate the problem of data bias between the real images and the synthetic images, we propose an intra-class loss with a novel real data-guided back-propagation (RDBP) algorithm.
arXiv Detail & Related papers (2020-02-06T10:56:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.