RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting
Self-Supervised Representations
- URL: http://arxiv.org/abs/2307.01233v1
- Date: Mon, 3 Jul 2023 09:13:57 GMT
- Title: RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting
Self-Supervised Representations
- Authors: Neha Sahipjohn, Neil Shah, Vishal Tambrahalli, Vineet Gandhi
- Abstract summary: We propose RobustL2S, a modularized framework for Lip-to-Speech synthesis.
A non-autoregressive sequence-to-sequence model maps self-supervised visual features to a representation of disentangled speech content.
A vocoder then converts the speech features into raw waveforms.
- Score: 13.995231731152462
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Significant progress has been made in speaker dependent Lip-to-Speech
synthesis, which aims to generate speech from silent videos of talking faces.
Current state-of-the-art approaches primarily employ non-autoregressive
sequence-to-sequence architectures to directly predict mel-spectrograms or
audio waveforms from lip representations. We hypothesize that the direct
mel-prediction hampers training/model efficiency due to the entanglement of
speech content with ambient information and speaker characteristics. To this
end, we propose RobustL2S, a modularized framework for Lip-to-Speech synthesis.
First, a non-autoregressive sequence-to-sequence model maps self-supervised
visual features to a representation of disentangled speech content. A vocoder
then converts the speech features into raw waveforms. Extensive evaluations
confirm the effectiveness of our setup, achieving state-of-the-art performance
on the unconstrained Lip2Wav dataset and the constrained GRID and TCD-TIMIT
datasets. Speech samples from RobustL2S can be found at
https://neha-sherin.github.io/RobustL2S/
Related papers
- PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and
Pause-based Prosody Modeling [25.966328901566815]
We propose PuaseSpeech, a speech synthesis system with a pre-trained language model and pause-based prosody modeling.
Experimental results show PauseSpeech outperforms previous models in terms of naturalness.
arXiv Detail & Related papers (2023-06-13T01:36:55Z) - Intelligible Lip-to-Speech Synthesis with Speech Units [32.65865343643458]
We propose a novel Lip-to-Speech synthesis (L2S) framework, for synthesizing intelligible speech from a silent lip movement video.
We introduce a multi-input vocoder that can generate a clear waveform even from blurry and noisy mel-spectrogram by referring to the speech units.
arXiv Detail & Related papers (2023-05-31T07:17:32Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - Learning Speaker-specific Lip-to-Speech Generation [28.620557933595585]
This work aims to understand the correlation/mapping between speech and the sequence of lip movement of individual speakers.
We learn temporal synchronization using deep metric learning, which guides the decoder to generate speech in sync with input lip movements.
We have trained our model on the Grid and Lip2Wav Chemistry lecture dataset to evaluate single speaker natural speech generation tasks.
arXiv Detail & Related papers (2022-06-04T19:40:02Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction
and Lip Reading [24.744371143092614]
The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos.
We propose LipSound2, which consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms.
arXiv Detail & Related papers (2021-12-09T08:11:35Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic
Voice Over [68.22776506861872]
We formulate a novel task to synthesize speech in sync with a silent pre-recorded video, denoted as automatic voice over (AVO)
A natural solution to AVO is to condition the speech rendering on the temporal progression of lip sequence in the video.
We propose a novel text-to-speech model that is conditioned on visual input, named VisualTTS, for accurate lip-speech synchronization.
arXiv Detail & Related papers (2021-10-07T11:25:25Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.