Related papers: Attention-based Residual Speech Portrait Model for Speech to Face Generation

Attention-based Residual Speech Portrait Model for Speech to Face Generation

URL: http://arxiv.org/abs/2007.04536v1
Date: Thu, 9 Jul 2020 03:31:33 GMT
Title: Attention-based Residual Speech Portrait Model for Speech to Face Generation
Authors: Jianrong Wang, Xiaosheng Hu, Li Liu, Wei Liu, Mei Yu, Tianyi Xu
Abstract summary: We propose a novel Attention-based Residual Speech Portrait Model (AR-SPM) Our proposed model accelerates the convergence of training, outperforms the state-of-the-art in terms of quality of the generated face.
Score: 14.299566923828719
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Given a speaker's speech, it is interesting to see if it is possible to generate this speaker's face. One main challenge in this task is to alleviate the natural mismatch between face and speech. To this end, in this paper, we propose a novel Attention-based Residual Speech Portrait Model (AR-SPM) by introducing the ideal of the residual into a hybrid encoder-decoder architecture, where face prior features are merged with the output of speech encoder to form the final face feature. In particular, we innovatively establish a tri-item loss function, which is a weighted linear combination of the L2-norm, L1-norm and negative cosine loss, to train our model by comparing the final face feature and true face feature. Evaluation on AVSpeech dataset shows that our proposed model accelerates the convergence of training, outperforms the state-of-the-art in terms of quality of the generated face, and achieves superior recognition accuracy of gender and age compared with the ground truth.

Related papers

RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework. In the first component, we consider both identity and intra-personal variation features related to speaking lip movements. In the second component, we design a lightweight facial identity alignment (FIA) module. This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z)
FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models [85.16273912625022]
We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from audio signal. To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of human heads.
arXiv Detail & Related papers (2023-12-13T19:01:07Z)
GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model. It can synthesize smooth lip dynamics while preserving the speaker's identity. Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z)
Realistic Speech-to-Face Generation with Speech-Conditioned Latent Diffusion Model with Face Prior [13.198105709331617]
We propose a novel speech-to-face generation framework, which leverages a Speech-Conditioned Latent Diffusion Model, called SCLDM. This is the first work to harness the exceptional modeling capabilities of diffusion models for speech-to-face generation. We show that our method can produce more realistic face images while preserving the identity of the speaker better than state-of-the-art methods.
arXiv Detail & Related papers (2023-10-05T07:44:49Z)
Face-StyleSpeech: Improved Face-to-Voice latent mapping for Natural Zero-shot Speech Synthesis from a Face Image [42.23406025068276]
We propose Face-StyleSpeech, a zero-shot Text-To-Speech model that generates natural speech conditioned on a face image. Experimental results demonstrate that Face-StyleSpeech effectively generates more natural speech from a face image than baselines.
arXiv Detail & Related papers (2023-09-25T13:46:00Z)
GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation [71.73912454164834]
A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency. NeRF has become a popular technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video. We propose GeneFace++ to handle these challenges by utilizing the rendering pitch contour as an auxiliary feature and introducing a temporal loss in the facial motion prediction process.
arXiv Detail & Related papers (2023-05-01T12:24:09Z)
Residual-guided Personalized Speech Synthesis based on Face Image [14.690030837311376]
Previous works derive personalized speech features by training the model on a large dataset composed of his/her audio sounds. In this work, we innovatively extract personalized speech features from human faces to synthesize personalized speech using neural vocoder.
arXiv Detail & Related papers (2022-04-01T15:27:14Z)
Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem. Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols. By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z)
Joint Face Image Restoration and Frontalization for Recognition [79.78729632975744]
In real-world scenarios, many factors may harm face recognition performance, e.g., large pose, bad illumination,low resolution, blur and noise. Previous efforts usually first restore the low-quality faces to high-quality ones and then perform face recognition. We propose an Multi-Degradation Face Restoration model to restore frontalized high-quality faces from the given low-quality ones.
arXiv Detail & Related papers (2021-05-12T03:52:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.