Learning Speaker-specific Lip-to-Speech Generation
- URL: http://arxiv.org/abs/2206.02050v1
- Date: Sat, 4 Jun 2022 19:40:02 GMT
- Title: Learning Speaker-specific Lip-to-Speech Generation
- Authors: Munender Varshney, Ravindra Yadav, Vinay P. Namboodiri, Rajesh M Hegde
- Abstract summary: This work aims to understand the correlation/mapping between speech and the sequence of lip movement of individual speakers.
We learn temporal synchronization using deep metric learning, which guides the decoder to generate speech in sync with input lip movements.
We have trained our model on the Grid and Lip2Wav Chemistry lecture dataset to evaluate single speaker natural speech generation tasks.
- Score: 28.620557933595585
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding the lip movement and inferring the speech from it is
notoriously difficult for the common person. The task of accurate lip-reading
gets help from various cues of the speaker and its contextual or environmental
setting. Every speaker has a different accent and speaking style, which can be
inferred from their visual and speech features. This work aims to understand
the correlation/mapping between speech and the sequence of lip movement of
individual speakers in an unconstrained and large vocabulary. We model the
frame sequence as a prior to the transformer in an auto-encoder setting and
learned a joint embedding that exploits temporal properties of both audio and
video. We learn temporal synchronization using deep metric learning, which
guides the decoder to generate speech in sync with input lip movements. The
predictive posterior thus gives us the generated speech in speaker speaking
style. We have trained our model on the Grid and Lip2Wav Chemistry lecture
dataset to evaluate single speaker natural speech generation tasks from lip
movement in an unconstrained natural setting. Extensive evaluation using
various qualitative and quantitative metrics with human evaluation also shows
that our method outperforms the Lip2Wav Chemistry dataset(large vocabulary in
an unconstrained setting) by a good margin across almost all evaluation metrics
and marginally outperforms the state-of-the-art on GRID dataset.
Related papers
- Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading [73.59525356467574]
A speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks.
Fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks.
Our approach consistently outperforms existing methods.
arXiv Detail & Related papers (2023-10-08T07:48:25Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - Seeing What You Said: Talking Face Generation Guided by a Lip Reading
Expert [89.07178484337865]
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input.
Previous studies revealed the importance of lip-speech synchronization and visual quality.
We propose using a lip-reading expert to improve the intelligibility of the generated lip regions.
arXiv Detail & Related papers (2023-03-29T07:51:07Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild [44.92322575562816]
We propose a VAE-GAN architecture that learns to associate the lip and speech sequences amidst the variations.
Our generator learns to synthesize speech in any voice for the lip sequences of any person.
We conduct numerous ablation studies to analyze the effect of different modules of our architecture.
arXiv Detail & Related papers (2022-09-01T17:50:29Z) - Show Me Your Face, And I'll Tell You How You Speak [0.0]
We explore the task of lip to speech synthesis, i.e., learning to generate speech given only the lip movements of a speaker.
We present a novel method "Lip2Speech", with key design choices to achieve accurate lip to speech synthesis in unconstrained scenarios.
arXiv Detail & Related papers (2022-06-28T13:52:47Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z) - Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis [37.37319356008348]
We explore the task of lip to speech synthesis, i.e., learning to generate natural speech given only the lip movements of a speaker.
We focus on learning accurate lip sequences to speech mappings for individual speakers in unconstrained, large vocabulary settings.
We propose a novel approach with key design choices to achieve accurate, natural lip to speech synthesis.
arXiv Detail & Related papers (2020-05-17T10:29:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.