Let There Be Sound: Reconstructing High Quality Speech from Silent
Videos
- URL: http://arxiv.org/abs/2308.15256v2
- Date: Thu, 4 Jan 2024 11:10:57 GMT
- Title: Let There Be Sound: Reconstructing High Quality Speech from Silent
Videos
- Authors: Ji-Hoon Kim, Jaehun Kim, Joon Son Chung
- Abstract summary: The goal of this work is to reconstruct high quality speech from lip motions alone.
A key challenge of lip-to-speech systems is the one-to-many mapping.
We propose a novel lip-to-speech system that significantly improves the generation quality.
- Score: 34.306490673301184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of this work is to reconstruct high quality speech from lip motions
alone, a task also known as lip-to-speech. A key challenge of lip-to-speech
systems is the one-to-many mapping caused by (1) the existence of homophenes
and (2) multiple speech variations, resulting in a mispronounced and
over-smoothed speech. In this paper, we propose a novel lip-to-speech system
that significantly improves the generation quality by alleviating the
one-to-many mapping problem from multiple perspectives. Specifically, we
incorporate (1) self-supervised speech representations to disambiguate
homophenes, and (2) acoustic variance information to model diverse speech
styles. Additionally, to better solve the aforementioned problem, we employ a
flow based post-net which captures and refines the details of the generated
speech. We perform extensive experiments on two datasets, and demonstrate that
our method achieves the generation quality close to that of real human
utterance, outperforming existing methods in terms of speech naturalness and
intelligibility by a large margin. Synthesised samples are available at our
demo page: https://mm.kaist.ac.kr/projects/LTBS.
Related papers
- Robust Multi-Modal Speech In-Painting: A Sequence-to-Sequence Approach [3.89476785897726]
We introduce and study a sequence-to-sequence (seq2seq) speech in-painting model that incorporates AV features.
Our approach extends AV speech in-painting techniques to scenarios where both audio and visual data may be jointly corrupted.
arXiv Detail & Related papers (2024-06-02T23:51:43Z) - Towards Accurate Lip-to-Speech Synthesis in-the-Wild [31.289366690147556]
We introduce a novel approach to address the task of synthesizing speech from silent videos of any in-the-wild speaker solely based on lip movements.
The traditional approach of directly generating speech from lip videos faces the challenge of not being able to learn a robust language model from speech alone.
We propose incorporating noisy text supervision using a state-of-the-art lip-to-text network that instills language information into our model.
arXiv Detail & Related papers (2024-03-02T04:07:24Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - The Ability of Self-Supervised Speech Models for Audio Representations [53.19715501273934]
Self-supervised learning (SSL) speech models have achieved unprecedented success in speech representation learning.
We conduct extensive experiments on abundant speech and non-speech audio datasets to evaluate the representation ability of state-of-the-art SSL speech models.
Results show that SSL speech models could extract meaningful features of a wide range of non-speech audio, while they may also fail on certain types of datasets.
arXiv Detail & Related papers (2022-09-26T15:21:06Z) - Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild [44.92322575562816]
We propose a VAE-GAN architecture that learns to associate the lip and speech sequences amidst the variations.
Our generator learns to synthesize speech in any voice for the lip sequences of any person.
We conduct numerous ablation studies to analyze the effect of different modules of our architecture.
arXiv Detail & Related papers (2022-09-01T17:50:29Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis [37.37319356008348]
We explore the task of lip to speech synthesis, i.e., learning to generate natural speech given only the lip movements of a speaker.
We focus on learning accurate lip sequences to speech mappings for individual speakers in unconstrained, large vocabulary settings.
We propose a novel approach with key design choices to achieve accurate, natural lip to speech synthesis.
arXiv Detail & Related papers (2020-05-17T10:29:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.