Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild
- URL: http://arxiv.org/abs/2209.00642v1
- Date: Thu, 1 Sep 2022 17:50:29 GMT
- Title: Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild
- Authors: Sindhu B Hegde, K R Prajwal, Rudrabha Mukhopadhyay, Vinay P
Namboodiri, C. V. Jawahar
- Abstract summary: We propose a VAE-GAN architecture that learns to associate the lip and speech sequences amidst the variations.
Our generator learns to synthesize speech in any voice for the lip sequences of any person.
We conduct numerous ablation studies to analyze the effect of different modules of our architecture.
- Score: 44.92322575562816
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we address the problem of generating speech from silent lip
videos for any speaker in the wild. In stark contrast to previous works, our
method (i) is not restricted to a fixed number of speakers, (ii) does not
explicitly impose constraints on the domain or the vocabulary and (iii) deals
with videos that are recorded in the wild as opposed to within laboratory
settings. The task presents a host of challenges, with the key one being that
many features of the desired target speech, like voice, pitch and linguistic
content, cannot be entirely inferred from the silent face video. In order to
handle these stochastic variations, we propose a new VAE-GAN architecture that
learns to associate the lip and speech sequences amidst the variations. With
the help of multiple powerful discriminators that guide the training process,
our generator learns to synthesize speech sequences in any voice for the lip
movements of any person. Extensive experiments on multiple datasets show that
we outperform all baselines by a large margin. Further, our network can be
fine-tuned on videos of specific identities to achieve a performance comparable
to single-speaker models that are trained on $4\times$ more data. We conduct
numerous ablation studies to analyze the effect of different modules of our
architecture. We also provide a demo video that demonstrates several
qualitative results along with the code and trained models on our website:
\url{http://cvit.iiit.ac.in/research/projects/cvit-projects/lip-to-speech-synthesis}}
Related papers
- Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis [13.702423348269155]
We propose a new task -- generating speech from videos of people and their transcripts (VTTS) -- to motivate new techniques for multimodal speech generation.
We present a decoder-only multimodal model for this task, which we call Visatronic.
It embeds vision, text and speech directly into the common subspace of a transformer model and uses an autoregressive loss to learn a generative model of discretized mel-spectrograms conditioned on speaker videos and transcripts of their speech.
arXiv Detail & Related papers (2024-11-26T18:57:29Z) - Towards Accurate Lip-to-Speech Synthesis in-the-Wild [31.289366690147556]
We introduce a novel approach to address the task of synthesizing speech from silent videos of any in-the-wild speaker solely based on lip movements.
The traditional approach of directly generating speech from lip videos faces the challenge of not being able to learn a robust language model from speech alone.
We propose incorporating noisy text supervision using a state-of-the-art lip-to-text network that instills language information into our model.
arXiv Detail & Related papers (2024-03-02T04:07:24Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - Let There Be Sound: Reconstructing High Quality Speech from Silent
Videos [34.306490673301184]
The goal of this work is to reconstruct high quality speech from lip motions alone.
A key challenge of lip-to-speech systems is the one-to-many mapping.
We propose a novel lip-to-speech system that significantly improves the generation quality.
arXiv Detail & Related papers (2023-08-29T12:30:53Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Learning Speaker-specific Lip-to-Speech Generation [28.620557933595585]
This work aims to understand the correlation/mapping between speech and the sequence of lip movement of individual speakers.
We learn temporal synchronization using deep metric learning, which guides the decoder to generate speech in sync with input lip movements.
We have trained our model on the Grid and Lip2Wav Chemistry lecture dataset to evaluate single speaker natural speech generation tasks.
arXiv Detail & Related papers (2022-06-04T19:40:02Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z) - Robust One Shot Audio to Video Generation [10.957973845883162]
OneShotA2V is a novel approach to synthesize a talking person video of arbitrary length using as input: an audio signal and a single unseen image of a person.
OneShotA2V leverages curriculum learning to learn movements of expressive facial components and hence generates a high-quality talking-head video of the given person.
arXiv Detail & Related papers (2020-12-14T10:50:05Z) - Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.