End-to-End Adversarial Text-to-Speech
- URL: http://arxiv.org/abs/2006.03575v3
- Date: Wed, 17 Mar 2021 11:42:25 GMT
- Title: End-to-End Adversarial Text-to-Speech
- Authors: Jeff Donahue, Sander Dieleman, Miko{\l}aj Bi\'nkowski, Erich Elsen,
Karen Simonyan
- Abstract summary: We learn to synthesise speech from normalised text or phonemes in an end-to-end manner.
Our proposed generator is feed-forward and thus efficient for both training and inference.
It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses.
- Score: 33.01223309795122
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern text-to-speech synthesis pipelines typically involve multiple
processing stages, each of which is designed or learnt independently from the
rest. In this work, we take on the challenging task of learning to synthesise
speech from normalised text or phonemes in an end-to-end manner, resulting in
models which operate directly on character or phoneme input sequences and
produce raw speech audio outputs. Our proposed generator is feed-forward and
thus efficient for both training and inference, using a differentiable
alignment scheme based on token length prediction. It learns to produce high
fidelity audio through a combination of adversarial feedback and prediction
losses constraining the generated audio to roughly match the ground truth in
terms of its total duration and mel-spectrogram. To allow the model to capture
temporal variation in the generated audio, we employ soft dynamic time warping
in the spectrogram-based prediction loss. The resulting model achieves a mean
opinion score exceeding 4 on a 5 point scale, which is comparable to the
state-of-the-art models relying on multi-stage training and additional
supervision.
Related papers
- High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - Differentiable Duration Modeling for End-to-End Text-to-Speech [6.571447892202893]
parallel text-to-speech (TTS) models have recently enabled fast and highly-natural speech synthesis.
We propose a differentiable duration method for learning monotonic sequences between input and output.
Our model learns to perform high-fidelity synthesis through a combination of adversarial training and matching the total ground-truth duration.
arXiv Detail & Related papers (2022-03-21T15:14:44Z) - SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language
Processing [77.4527868307914]
We propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.
The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets.
To align the textual and speech information into a unified semantic space, we propose a cross-modal vector quantization method with random mixing-up to bridge speech and text.
arXiv Detail & Related papers (2021-10-14T07:59:27Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - Incremental Text to Speech for Neural Sequence-to-Sequence Models using
Reinforcement Learning [60.20205278845412]
Modern approaches to text to speech require the entire input character sequence to be processed before any audio is synthesised.
This latency limits the suitability of such models for time-sensitive tasks like simultaneous interpretation.
We propose a reinforcement learning based framework to train an agent to make this decision.
arXiv Detail & Related papers (2020-08-07T11:48:05Z) - FastPitch: Parallel Text-to-speech with Pitch Prediction [9.213700601337388]
We present FastPitch, a fully-parallel text-to-speech model based on FastSpeech.
The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive.
arXiv Detail & Related papers (2020-06-11T23:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.