Towards Accurate Lip-to-Speech Synthesis in-the-Wild
- URL: http://arxiv.org/abs/2403.01087v1
- Date: Sat, 2 Mar 2024 04:07:24 GMT
- Title: Towards Accurate Lip-to-Speech Synthesis in-the-Wild
- Authors: Sindhu Hegde, Rudrabha Mukhopadhyay, C.V. Jawahar, Vinay Namboodiri
- Abstract summary: We introduce a novel approach to address the task of synthesizing speech from silent videos of any in-the-wild speaker solely based on lip movements.
The traditional approach of directly generating speech from lip videos faces the challenge of not being able to learn a robust language model from speech alone.
We propose incorporating noisy text supervision using a state-of-the-art lip-to-text network that instills language information into our model.
- Score: 31.289366690147556
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce a novel approach to address the task of
synthesizing speech from silent videos of any in-the-wild speaker solely based
on lip movements. The traditional approach of directly generating speech from
lip videos faces the challenge of not being able to learn a robust language
model from speech alone, resulting in unsatisfactory outcomes. To overcome this
issue, we propose incorporating noisy text supervision using a state-of-the-art
lip-to-text network that instills language information into our model. The
noisy text is generated using a pre-trained lip-to-text model, enabling our
approach to work without text annotations during inference. We design a visual
text-to-speech network that utilizes the visual stream to generate accurate
speech, which is in-sync with the silent input video. We perform extensive
experiments and ablation studies, demonstrating our approach's superiority over
the current state-of-the-art methods on various benchmark datasets. Further, we
demonstrate an essential practical application of our method in assistive
technology by generating speech for an ALS patient who has lost the voice but
can make mouth movements. Our demo video, code, and additional details can be
found at
\url{http://cvit.iiit.ac.in/research/projects/cvit-projects/ms-l2s-itw}.
Related papers
- Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis [13.702423348269155]
We propose a new task -- generating speech from videos of people and their transcripts (VTTS) -- to motivate new techniques for multimodal speech generation.
We present a decoder-only multimodal model for this task, which we call Visatronic.
It embeds vision, text and speech directly into the common subspace of a transformer model and uses an autoregressive loss to learn a generative model of discretized mel-spectrograms conditioned on speaker videos and transcripts of their speech.
arXiv Detail & Related papers (2024-11-26T18:57:29Z) - JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation [24.2065254076207]
We introduce a novel method for joint expression and audio-guided talking face generation.
Our method can synthesize high-fidelity talking face videos, achieving state-of-the-art facial expression transfer.
arXiv Detail & Related papers (2024-09-18T17:18:13Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild [44.92322575562816]
We propose a VAE-GAN architecture that learns to associate the lip and speech sequences amidst the variations.
Our generator learns to synthesize speech in any voice for the lip sequences of any person.
We conduct numerous ablation studies to analyze the effect of different modules of our architecture.
arXiv Detail & Related papers (2022-09-01T17:50:29Z) - Video-Guided Curriculum Learning for Spoken Video Grounding [65.49979202728167]
We introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions.
To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL)
In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet.
arXiv Detail & Related papers (2022-09-01T07:47:01Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic
Voice Over [68.22776506861872]
We formulate a novel task to synthesize speech in sync with a silent pre-recorded video, denoted as automatic voice over (AVO)
A natural solution to AVO is to condition the speech rendering on the temporal progression of lip sequence in the video.
We propose a novel text-to-speech model that is conditioned on visual input, named VisualTTS, for accurate lip-speech synchronization.
arXiv Detail & Related papers (2021-10-07T11:25:25Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Visual Speech Enhancement Without A Real Visual Stream [37.88869937166955]
Current state-of-the-art methods use only the audio stream and are limited in their performance in a wide range of real-world noises.
Recent works using lip movements as additional cues improve the quality of generated speech over "audio-only" methods.
We propose a new paradigm for speech enhancement by exploiting recent breakthroughs in speech-driven lip synthesis.
arXiv Detail & Related papers (2020-12-20T06:02:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.