Facetron: Multi-speaker Face-to-Speech Model based on Cross-modal Latent
Representations
- URL: http://arxiv.org/abs/2107.12003v1
- Date: Mon, 26 Jul 2021 07:36:02 GMT
- Title: Facetron: Multi-speaker Face-to-Speech Model based on Cross-modal Latent
Representations
- Authors: Se-Yun Um, Jihyun Kim, Jihyun Lee, Sangshin Oh, Kyungguen Byun, and
Hong-Goo Kang
- Abstract summary: We propose an effective method to synthesize speaker-specific speech waveforms by conditioning on videos of an individual's face.
The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images.
We show the superiority of our proposed model over conventional methods in terms of both objective and subjective evaluation results.
- Score: 22.14238843571225
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose an effective method to synthesize speaker-specific
speech waveforms by conditioning on videos of an individual's face. Using a
generative adversarial network (GAN) with linguistic and speaker characteristic
features as auxiliary conditions, our method directly converts face images into
speech waveforms under an end-to-end training framework. The linguistic
features are extracted from lip movements using a lip-reading model, and the
speaker characteristic features are predicted from face images using
cross-modal learning with a pre-trained acoustic model. Since these two
features are uncorrelated and controlled independently, we can flexibly
synthesize speech waveforms whose speaker characteristics vary depending on the
input face images. Therefore, our method can be regarded as a multi-speaker
face-to-speech waveform model. We show the superiority of our proposed model
over conventional methods in terms of both objective and subjective evaluation
results. Specifically, we evaluate the performances of the linguistic feature
and the speaker characteristic generation modules by measuring the accuracy of
automatic speech recognition and automatic speaker/gender recognition tasks,
respectively. We also evaluate the naturalness of the synthesized speech
waveforms using a mean opinion score (MOS) test.
Related papers
- We Need Variations in Speech Synthesis: Sub-center Modelling for Speaker Embeddings [47.2515056854372]
In speech synthesis, modeling of rich emotions and prosodic variations present in human voice are crucial to synthesize natural speech.
We propose a novel speaker embedding network which utilizes multiple class centers in the speaker classification training rather than a single class center as traditional embeddings.
arXiv Detail & Related papers (2024-07-05T06:54:24Z) - Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - ELF: Encoding Speaker-Specific Latent Speech Feature for Speech Synthesis [5.824018496599849]
We propose a novel method for modeling numerous speakers.
It enables expressing the overall characteristics of speakers in detail like a trained multi-speaker model.
arXiv Detail & Related papers (2023-11-20T13:13:24Z) - Emotional Listener Portrait: Realistic Listener Motion Simulation in
Conversation [50.35367785674921]
Listener head generation centers on generating non-verbal behaviors of a listener in reference to the information delivered by a speaker.
A significant challenge when generating such responses is the non-deterministic nature of fine-grained facial expressions during a conversation.
We propose the Emotional Listener Portrait (ELP), which treats each fine-grained facial motion as a composition of several discrete motion-codewords.
Our ELP model can not only automatically generate natural and diverse responses toward a given speaker via sampling from the learned distribution but also generate controllable responses with a predetermined attitude.
arXiv Detail & Related papers (2023-09-29T18:18:32Z) - Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models [64.14812728562596]
We present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner.
We can easily generate face videos that articulate the provided textual sentences.
arXiv Detail & Related papers (2023-06-28T08:22:53Z) - Zero-shot personalized lip-to-speech synthesis with face image based
voice control [41.17483247506426]
Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies.
We propose a zero-shot personalized Lip2Speech synthesis method, in which face images control speaker identities.
arXiv Detail & Related papers (2023-05-09T02:37:29Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly
Disentangled Self-supervised Speech Representations [12.20522794248598]
We propose a zero-shot voice conversion method using speech representations trained with self-supervised learning.
We develop a multi-task model to decompose a speech utterance into features such as linguistic content, speaker characteristics, and speaking style.
Next, we develop a synthesis model with pitch and duration predictors that can effectively reconstruct the speech signal from its representation.
arXiv Detail & Related papers (2023-02-16T08:10:41Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.