Intelligible Lip-to-Speech Synthesis with Speech Units
- URL: http://arxiv.org/abs/2305.19603v1
- Date: Wed, 31 May 2023 07:17:32 GMT
- Title: Intelligible Lip-to-Speech Synthesis with Speech Units
- Authors: Jeongsoo Choi, Minsu Kim, Yong Man Ro
- Abstract summary: We propose a novel Lip-to-Speech synthesis (L2S) framework, for synthesizing intelligible speech from a silent lip movement video.
We introduce a multi-input vocoder that can generate a clear waveform even from blurry and noisy mel-spectrogram by referring to the speech units.
- Score: 32.65865343643458
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a novel Lip-to-Speech synthesis (L2S) framework,
for synthesizing intelligible speech from a silent lip movement video.
Specifically, to complement the insufficient supervisory signal of the previous
L2S model, we propose to use quantized self-supervised speech representations,
named speech units, as an additional prediction target for the L2S model.
Therefore, the proposed L2S model is trained to generate multiple targets,
mel-spectrogram and speech units. As the speech units are discrete while
mel-spectrogram is continuous, the proposed multi-target L2S model can be
trained with strong content supervision, without using text-labeled data.
Moreover, to accurately convert the synthesized mel-spectrogram into a
waveform, we introduce a multi-input vocoder that can generate a clear waveform
even from blurry and noisy mel-spectrogram by referring to the speech units.
Extensive experimental results confirm the effectiveness of the proposed method
in L2S.
Related papers
- Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting [14.402357651227003]
We investigate the use of a speech SSL model for speech inpainting, that is reconstructing a missing portion of a speech signal from its surrounding context.
To that purpose, we combine an SSL encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role of a decoder.
arXiv Detail & Related papers (2024-05-30T14:41:39Z) - Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting
Self-Supervised Representations [13.995231731152462]
We propose RobustL2S, a modularized framework for Lip-to-Speech synthesis.
A non-autoregressive sequence-to-sequence model maps self-supervised visual features to a representation of disentangled speech content.
A vocoder then converts the speech features into raw waveforms.
arXiv Detail & Related papers (2023-07-03T09:13:57Z) - Lip-to-Speech Synthesis in the Wild with Multi-task Learning [32.65865343643458]
We develop a powerful Lip2Speech method that can reconstruct speech with correct contents from the input lip movements, even in a wild environment.
We design multi-task learning that guides the model using multimodal supervision, i.e., text and audio, to complement the insufficient word representations of acoustic feature reconstruction loss.
arXiv Detail & Related papers (2023-02-17T12:31:26Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.