Related papers: Realistic Speech-to-Face Generation with Speech-Conditioned Latent Diffusion Model with Face Prior

Realistic Speech-to-Face Generation with Speech-Conditioned Latent Diffusion Model with Face Prior

URL: http://arxiv.org/abs/2310.03363v1
Date: Thu, 5 Oct 2023 07:44:49 GMT
Title: Realistic Speech-to-Face Generation with Speech-Conditioned Latent Diffusion Model with Face Prior
Authors: Jinting Wang, Li Liu, Jun Wang, Hei Victor Cheng
Abstract summary: We propose a novel speech-to-face generation framework, which leverages a Speech-Conditioned Latent Diffusion Model, called SCLDM. This is the first work to harness the exceptional modeling capabilities of diffusion models for speech-to-face generation. We show that our method can produce more realistic face images while preserving the identity of the speaker better than state-of-the-art methods.
Score: 13.198105709331617
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Speech-to-face generation is an intriguing area of research that focuses on generating realistic facial images based on a speaker's audio speech. However, state-of-the-art methods employing GAN-based architectures lack stability and cannot generate realistic face images. To fill this gap, we propose a novel speech-to-face generation framework, which leverages a Speech-Conditioned Latent Diffusion Model, called SCLDM. To the best of our knowledge, this is the first work to harness the exceptional modeling capabilities of diffusion models for speech-to-face generation. Preserving the shared identity information between speech and face is crucial in generating realistic results. Therefore, we employ contrastive pre-training for both the speech encoder and the face encoder. This pre-training strategy facilitates effective alignment between the attributes of speech, such as age and gender, and the corresponding facial characteristics in the face images. Furthermore, we tackle the challenge posed by excessive diversity in the synthesis process caused by the diffusion model. To overcome this challenge, we introduce the concept of residuals by integrating a statistical face prior to the diffusion process. This addition helps to eliminate the shared component across the faces and enhances the subtle variations captured by the speech condition. Extensive quantitative, qualitative, and user study experiments demonstrate that our method can produce more realistic face images while preserving the identity of the speaker better than state-of-the-art methods. Highlighting the notable enhancements, our method demonstrates significant gains in all metrics on the AVSpeech dataset and Voxceleb dataset, particularly noteworthy are the improvements of 32.17 and 32.72 on the cosine distance metric for the two datasets, respectively.

Related papers

OSDFace: One-Step Diffusion Model for Face Restoration [72.5045389847792]
Diffusion models have demonstrated impressive performance in face restoration. We propose OSDFace, a novel one-step diffusion model for face restoration. Results demonstrate that OSDFace surpasses current state-of-the-art (SOTA) methods in both visual quality and quantitative metrics.
arXiv Detail & Related papers (2024-11-26T07:07:48Z)
High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation. We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw. Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z)
RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework. In the first component, we consider both identity and intra-personal variation features related to speaking lip movements. In the second component, we design a lightweight facial identity alignment (FIA) module. This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z)
Face-StyleSpeech: Enhancing Zero-shot Speech Synthesis from Face Images with Improved Face-to-Speech Mapping [37.57813713418656]
We propose Face-StyleSpeech, a zero-shot Text-To-Speech model that generates natural speech conditioned on a face image. We show that Face-StyleSpeech effectively generates more natural speech from a face image than baselines, even for unseen faces.
arXiv Detail & Related papers (2023-09-25T13:46:00Z)
Parametric Implicit Face Representation for Audio-Driven Facial Reenactment [52.33618333954383]
We propose a novel audio-driven facial reenactment framework that is both controllable and can generate high-quality talking heads. Specifically, our parametric implicit representation parameterizes the implicit representation with interpretable parameters of 3D face models. Our method can generate more realistic results than previous methods with greater fidelity to the identities and talking styles of speakers.
arXiv Detail & Related papers (2023-06-13T07:08:22Z)
Expression-preserving face frontalization improves visually assisted speech processing [35.647888055229956]
The main contribution of this paper is a frontalization methodology that preserves non-rigid facial deformations. We show that the method, when incorporated into deep learning pipelines, improves word recognition and speech intelligibilty scores by a considerable margin.
arXiv Detail & Related papers (2022-04-06T13:22:24Z)
Attention-based Residual Speech Portrait Model for Speech to Face Generation [14.299566923828719]
We propose a novel Attention-based Residual Speech Portrait Model (AR-SPM) Our proposed model accelerates the convergence of training, outperforms the state-of-the-art in terms of quality of the generated face.
arXiv Detail & Related papers (2020-07-09T03:31:33Z)
Speech Fusion to Face: Bridging the Gap Between Human's Vocal Characteristics and Facial Imaging [19.285149134711382]
Facial image generation based on vocal characteristics from speech is one of such important yet challenging tasks. Existing solutions to the problem of speech2face renders limited image quality and fails to preserve facial similarity. We propose Speech Fusion to Face, or SF2F, attempting to address the issue of facial image quality and the poor connection between vocal feature domain and modern image generation models.
arXiv Detail & Related papers (2020-06-10T15:19:31Z)
From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech [20.41722156886205]
We propose a multi-modal learning framework that links the inference stage and generation stage. The proposed method exploits the recent development of GANs techniques and generates the human face directly from the speech waveform. Experimental results show that the proposed network can not only match the relationship between the human face and speech, but can also generate the high-quality human face sample conditioned on its speech.
arXiv Detail & Related papers (2020-04-13T09:01:49Z)
Dual-Attention GAN for Large-Pose Face Frontalization [59.689836951934694]
We present a novel Dual-Attention Generative Adversarial Network (DA-GAN) for photo-realistic face frontalization. Specifically, a self-attention-based generator is introduced to integrate local features with their long-range dependencies. A novel face-attention-based discriminator is applied to emphasize local features of face regions.
arXiv Detail & Related papers (2020-02-17T20:00:56Z)
Joint Deep Learning of Facial Expression Synthesis and Recognition [97.19528464266824]
We propose a novel joint deep learning of facial expression synthesis and recognition method for effective FER. The proposed method involves a two-stage learning procedure. Firstly, a facial expression synthesis generative adversarial network (FESGAN) is pre-trained to generate facial images with different facial expressions. In order to alleviate the problem of data bias between the real images and the synthetic images, we propose an intra-class loss with a novel real data-guided back-propagation (RDBP) algorithm.
arXiv Detail & Related papers (2020-02-06T10:56:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.