Face-StyleSpeech: Improved Face-to-Voice latent mapping for Natural
Zero-shot Speech Synthesis from a Face Image
- URL: http://arxiv.org/abs/2311.05844v1
- Date: Mon, 25 Sep 2023 13:46:00 GMT
- Title: Face-StyleSpeech: Improved Face-to-Voice latent mapping for Natural
Zero-shot Speech Synthesis from a Face Image
- Authors: Minki Kang, Wooseok Han, Eunho Yang
- Abstract summary: We propose Face-StyleSpeech, a zero-shot Text-To-Speech model that generates natural speech conditioned on a face image.
Experimental results demonstrate that Face-StyleSpeech effectively generates more natural speech from a face image than baselines.
- Score: 42.23406025068276
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating a voice from a face image is crucial for developing virtual humans
capable of interacting using their unique voices, without relying on
pre-recorded human speech. In this paper, we propose Face-StyleSpeech, a
zero-shot Text-To-Speech (TTS) synthesis model that generates natural speech
conditioned on a face image rather than reference speech. We hypothesize that
learning both speaker identity and prosody from a face image poses a
significant challenge. To address the issue, our TTS model incorporates both a
face encoder and a prosody encoder. The prosody encoder is specifically
designed to model prosodic features that are not captured only with a face
image, allowing the face encoder to focus solely on capturing the speaker
identity from the face image. Experimental results demonstrate that
Face-StyleSpeech effectively generates more natural speech from a face image
than baselines, even for the face images the model has not trained. Samples are
at our demo page https://face-stylespeech.github.io.
Related papers
- FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles [29.185409608539747]
Vision-driven Text-to-speech (TTS) scholars grounded their investigations on real-person faces.
We introduce a novel FaceSpeak approach, which extracts salient identity characteristics and emotional representations from a wide variety of image styles.
It mitigates the extraneous information, resulting in synthesized speech closely aligned with a character's persona.
arXiv Detail & Related papers (2025-01-02T02:00:15Z) - GaussianSpeech: Audio-Driven Gaussian Avatars [76.10163891172192]
We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio.
We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details.
arXiv Detail & Related papers (2024-11-27T18:54:08Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - Realistic Speech-to-Face Generation with Speech-Conditioned Latent
Diffusion Model with Face Prior [13.198105709331617]
We propose a novel speech-to-face generation framework, which leverages a Speech-Conditioned Latent Diffusion Model, called SCLDM.
This is the first work to harness the exceptional modeling capabilities of diffusion models for speech-to-face generation.
We show that our method can produce more realistic face images while preserving the identity of the speaker better than state-of-the-art methods.
arXiv Detail & Related papers (2023-10-05T07:44:49Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech [33.01930038988336]
We introduce a face-styled diffusion text-to-speech (TTS) model within a unified framework, called Face-TTS.
We jointly train cross-model biometrics and TTS models to preserve speaker identity between face images and generated speech segments.
Since the biometric information is extracted directly from the face image, our method does not require extra fine-tuning steps to generate speech from unseen and unheard speakers.
arXiv Detail & Related papers (2023-02-27T11:59:28Z) - Imitator: Personalized Speech-driven 3D Facial Animation [63.57811510502906]
State-of-the-art methods deform the face topology of the target actor to sync the input audio without considering the identity-specific speaking style and facial idiosyncrasies of the target actor.
We present Imitator, a speech-driven facial expression synthesis method, which learns identity-specific details from a short input video.
We show that our approach produces temporally coherent facial expressions from input audio while preserving the speaking style of the target actors.
arXiv Detail & Related papers (2022-12-30T19:00:02Z) - Residual-guided Personalized Speech Synthesis based on Face Image [14.690030837311376]
Previous works derive personalized speech features by training the model on a large dataset composed of his/her audio sounds.
In this work, we innovatively extract personalized speech features from human faces to synthesize personalized speech using neural vocoder.
arXiv Detail & Related papers (2022-04-01T15:27:14Z) - AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary
Person [21.126759304401627]
We present an automatic method to generate synchronized speech and talking-head videos on the basis of text and a single face image of an arbitrary person as input.
Experiments demonstrate that the proposed method is able to generate synchronized speech and talking head videos for arbitrary persons and non-persons.
arXiv Detail & Related papers (2021-08-09T19:58:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.