Related papers: Face-StyleSpeech: Improved Face-to-Voice latent mapping for Natural Zero-shot Speech Synthesis from a Face Image

Face-StyleSpeech: Improved Face-to-Voice latent mapping for Natural Zero-shot Speech Synthesis from a Face Image

URL: http://arxiv.org/abs/2311.05844v1
Date: Mon, 25 Sep 2023 13:46:00 GMT
Title: Face-StyleSpeech: Improved Face-to-Voice latent mapping for Natural Zero-shot Speech Synthesis from a Face Image
Authors: Minki Kang, Wooseok Han, Eunho Yang
Abstract summary: We propose Face-StyleSpeech, a zero-shot Text-To-Speech model that generates natural speech conditioned on a face image. Experimental results demonstrate that Face-StyleSpeech effectively generates more natural speech from a face image than baselines.
Score: 42.23406025068276
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generating a voice from a face image is crucial for developing virtual humans capable of interacting using their unique voices, without relying on pre-recorded human speech. In this paper, we propose Face-StyleSpeech, a zero-shot Text-To-Speech (TTS) synthesis model that generates natural speech conditioned on a face image rather than reference speech. We hypothesize that learning both speaker identity and prosody from a face image poses a significant challenge. To address the issue, our TTS model incorporates both a face encoder and a prosody encoder. The prosody encoder is specifically designed to model prosodic features that are not captured only with a face image, allowing the face encoder to focus solely on capturing the speaker identity from the face image. Experimental results demonstrate that Face-StyleSpeech effectively generates more natural speech from a face image than baselines, even for the face images the model has not trained. Samples are at our demo page https://face-stylespeech.github.io.

Related papers

FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles [29.185409608539747]
Vision-driven Text-to-speech (TTS) scholars grounded their investigations on real-person faces. We introduce a novel FaceSpeak approach, which extracts salient identity characteristics and emotional representations from a wide variety of image styles. It mitigates the extraneous information, resulting in synthesized speech closely aligned with a character's persona.
arXiv Detail & Related papers (2025-01-02T02:00:15Z)
GaussianSpeech: Audio-Driven Gaussian Avatars [76.10163891172192]
We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio. We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details.
arXiv Detail & Related papers (2024-11-27T18:54:08Z)
GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model. It can synthesize smooth lip dynamics while preserving the speaker's identity. Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z)
ChatAnything: Facetime Chat with LLM-Enhanced Personas [87.76804680223003]
We propose the mixture of voices (MoV) and the mixture of diffusers (MoD) for diverse voice and appearance generation. For MoV, we utilize the text-to-speech (TTS) algorithms with a variety of pre-defined tones. MoD, we combine the recent popular text-to-image generation techniques and talking head algorithms to streamline the process of generating talking objects.
arXiv Detail & Related papers (2023-11-12T08:29:41Z)
Realistic Speech-to-Face Generation with Speech-Conditioned Latent Diffusion Model with Face Prior [13.198105709331617]
We propose a novel speech-to-face generation framework, which leverages a Speech-Conditioned Latent Diffusion Model, called SCLDM. This is the first work to harness the exceptional modeling capabilities of diffusion models for speech-to-face generation. We show that our method can produce more realistic face images while preserving the identity of the speaker better than state-of-the-art methods.
arXiv Detail & Related papers (2023-10-05T07:44:49Z)
Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance. We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z)
Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication. We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z)
Zero-shot personalized lip-to-speech synthesis with face image based voice control [41.17483247506426]
Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies. We propose a zero-shot personalized Lip2Speech synthesis method, in which face images control speaker identities.
arXiv Detail & Related papers (2023-05-09T02:37:29Z)
Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech [33.01930038988336]
We introduce a face-styled diffusion text-to-speech (TTS) model within a unified framework, called Face-TTS. We jointly train cross-model biometrics and TTS models to preserve speaker identity between face images and generated speech segments. Since the biometric information is extracted directly from the face image, our method does not require extra fine-tuning steps to generate speech from unseen and unheard speakers.
arXiv Detail & Related papers (2023-02-27T11:59:28Z)
Imitator: Personalized Speech-driven 3D Facial Animation [63.57811510502906]
State-of-the-art methods deform the face topology of the target actor to sync the input audio without considering the identity-specific speaking style and facial idiosyncrasies of the target actor. We present Imitator, a speech-driven facial expression synthesis method, which learns identity-specific details from a short input video. We show that our approach produces temporally coherent facial expressions from input audio while preserving the speaking style of the target actors.
arXiv Detail & Related papers (2022-12-30T19:00:02Z)
Residual-guided Personalized Speech Synthesis based on Face Image [14.690030837311376]
Previous works derive personalized speech features by training the model on a large dataset composed of his/her audio sounds. In this work, we innovatively extract personalized speech features from human faces to synthesize personalized speech using neural vocoder.
arXiv Detail & Related papers (2022-04-01T15:27:14Z)
AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person [21.126759304401627]
We present an automatic method to generate synchronized speech and talking-head videos on the basis of text and a single face image of an arbitrary person as input. Experiments demonstrate that the proposed method is able to generate synchronized speech and talking head videos for arbitrary persons and non-persons.
arXiv Detail & Related papers (2021-08-09T19:58:38Z)
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces. We operate on raw face images, using only a single photo as an identity reference. Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.