Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis
- URL: http://arxiv.org/abs/2005.08209v1
- Date: Sun, 17 May 2020 10:29:19 GMT
- Title: Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis
- Authors: K R Prajwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, C V Jawahar
- Abstract summary: We explore the task of lip to speech synthesis, i.e., learning to generate natural speech given only the lip movements of a speaker.
We focus on learning accurate lip sequences to speech mappings for individual speakers in unconstrained, large vocabulary settings.
We propose a novel approach with key design choices to achieve accurate, natural lip to speech synthesis.
- Score: 37.37319356008348
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans involuntarily tend to infer parts of the conversation from lip
movements when the speech is absent or corrupted by external noise. In this
work, we explore the task of lip to speech synthesis, i.e., learning to
generate natural speech given only the lip movements of a speaker.
Acknowledging the importance of contextual and speaker-specific cues for
accurate lip-reading, we take a different path from existing works. We focus on
learning accurate lip sequences to speech mappings for individual speakers in
unconstrained, large vocabulary settings. To this end, we collect and release a
large-scale benchmark dataset, the first of its kind, specifically to train and
evaluate the single-speaker lip to speech task in natural settings. We propose
a novel approach with key design choices to achieve accurate, natural lip to
speech synthesis in such unconstrained scenarios for the first time. Extensive
evaluation using quantitative, qualitative metrics and human evaluation shows
that our method is four times more intelligible than previous works in this
space. Please check out our demo video for a quick overview of the paper,
method, and qualitative results.
https://www.youtube.com/watch?v=HziA-jmlk_4&feature=youtu.be
Related papers
- Towards Accurate Lip-to-Speech Synthesis in-the-Wild [31.289366690147556]
We introduce a novel approach to address the task of synthesizing speech from silent videos of any in-the-wild speaker solely based on lip movements.
The traditional approach of directly generating speech from lip videos faces the challenge of not being able to learn a robust language model from speech alone.
We propose incorporating noisy text supervision using a state-of-the-art lip-to-text network that instills language information into our model.
arXiv Detail & Related papers (2024-03-02T04:07:24Z) - Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading [73.59525356467574]
A speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks.
Fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks.
Our approach consistently outperforms existing methods.
arXiv Detail & Related papers (2023-10-08T07:48:25Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - Let There Be Sound: Reconstructing High Quality Speech from Silent
Videos [34.306490673301184]
The goal of this work is to reconstruct high quality speech from lip motions alone.
A key challenge of lip-to-speech systems is the one-to-many mapping.
We propose a novel lip-to-speech system that significantly improves the generation quality.
arXiv Detail & Related papers (2023-08-29T12:30:53Z) - Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation [58.72068260933836]
Context-Aware LipSync- framework (CALS)
CALS is comprised of an Audio-to-Lip map module and a Lip-to-Face module.
arXiv Detail & Related papers (2023-05-31T04:50:32Z) - Seeing What You Said: Talking Face Generation Guided by a Lip Reading
Expert [89.07178484337865]
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input.
Previous studies revealed the importance of lip-speech synchronization and visual quality.
We propose using a lip-reading expert to improve the intelligibility of the generated lip regions.
arXiv Detail & Related papers (2023-03-29T07:51:07Z) - Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild [44.92322575562816]
We propose a VAE-GAN architecture that learns to associate the lip and speech sequences amidst the variations.
Our generator learns to synthesize speech in any voice for the lip sequences of any person.
We conduct numerous ablation studies to analyze the effect of different modules of our architecture.
arXiv Detail & Related papers (2022-09-01T17:50:29Z) - Show Me Your Face, And I'll Tell You How You Speak [0.0]
We explore the task of lip to speech synthesis, i.e., learning to generate speech given only the lip movements of a speaker.
We present a novel method "Lip2Speech", with key design choices to achieve accurate lip to speech synthesis in unconstrained scenarios.
arXiv Detail & Related papers (2022-06-28T13:52:47Z) - Learning Speaker-specific Lip-to-Speech Generation [28.620557933595585]
This work aims to understand the correlation/mapping between speech and the sequence of lip movement of individual speakers.
We learn temporal synchronization using deep metric learning, which guides the decoder to generate speech in sync with input lip movements.
We have trained our model on the Grid and Lip2Wav Chemistry lecture dataset to evaluate single speaker natural speech generation tasks.
arXiv Detail & Related papers (2022-06-04T19:40:02Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.