Residual-guided Personalized Speech Synthesis based on Face Image
- URL: http://arxiv.org/abs/2204.01672v1
- Date: Fri, 1 Apr 2022 15:27:14 GMT
- Title: Residual-guided Personalized Speech Synthesis based on Face Image
- Authors: Jianrong Wang, Zixuan Wang, Xiaosheng Hu, Xuewei Li, Qiang Fang, Li
Liu
- Abstract summary: Previous works derive personalized speech features by training the model on a large dataset composed of his/her audio sounds.
In this work, we innovatively extract personalized speech features from human faces to synthesize personalized speech using neural vocoder.
- Score: 14.690030837311376
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Previous works derive personalized speech features by training the model on a
large dataset composed of his/her audio sounds. It was reported that face
information has a strong link with the speech sound. Thus in this work, we
innovatively extract personalized speech features from human faces to
synthesize personalized speech using neural vocoder. A Face-based Residual
Personalized Speech Synthesis Model (FR-PSS) containing a speech encoder, a
speech synthesizer and a face encoder is designed for PSS. In this model, by
designing two speech priors, a residual-guided strategy is introduced to guide
the face feature to approach the true speech feature in the training. Moreover,
considering the error of feature's absolute values and their directional bias,
we formulate a novel tri-item loss function for face encoder. Experimental
results show that the speech synthesized by our model is comparable to the
personalized speech synthesized by training a large amount of audio data in
previous works.
Related papers
- Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - Face-StyleSpeech: Improved Face-to-Voice latent mapping for Natural
Zero-shot Speech Synthesis from a Face Image [42.23406025068276]
We propose Face-StyleSpeech, a zero-shot Text-To-Speech model that generates natural speech conditioned on a face image.
Experimental results demonstrate that Face-StyleSpeech effectively generates more natural speech from a face image than baselines.
arXiv Detail & Related papers (2023-09-25T13:46:00Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models [64.14812728562596]
We present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner.
We can easily generate face videos that articulate the provided textual sentences.
arXiv Detail & Related papers (2023-06-28T08:22:53Z) - Zero-shot personalized lip-to-speech synthesis with face image based
voice control [41.17483247506426]
Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies.
We propose a zero-shot personalized Lip2Speech synthesis method, in which face images control speaker identities.
arXiv Detail & Related papers (2023-05-09T02:37:29Z) - VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via
Speech-Visage Feature Selection [32.65865343643458]
Recent studies have shown impressive performance on synthesizing speech from silent talking face videos.
We introduce speech-visage selection module that separates the speech content and the speaker identity from the visual features of the input video.
Proposed framework brings the advantage of synthesizing the speech containing the right content even when the silent talking face video of an unseen subject is given.
arXiv Detail & Related papers (2022-06-15T11:29:58Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary
Person [21.126759304401627]
We present an automatic method to generate synchronized speech and talking-head videos on the basis of text and a single face image of an arbitrary person as input.
Experiments demonstrate that the proposed method is able to generate synchronized speech and talking head videos for arbitrary persons and non-persons.
arXiv Detail & Related papers (2021-08-09T19:58:38Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.