Differentiable Duration Modeling for End-to-End Text-to-Speech
- URL: http://arxiv.org/abs/2203.11049v1
- Date: Mon, 21 Mar 2022 15:14:44 GMT
- Title: Differentiable Duration Modeling for End-to-End Text-to-Speech
- Authors: Bac Nguyen, Fabien Cardinaux, Stefan Uhlich
- Abstract summary: parallel text-to-speech (TTS) models have recently enabled fast and highly-natural speech synthesis.
We propose a differentiable duration method for learning monotonic sequences between input and output.
Our model learns to perform high-fidelity synthesis through a combination of adversarial training and matching the total ground-truth duration.
- Score: 6.571447892202893
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Parallel text-to-speech (TTS) models have recently enabled fast and
highly-natural speech synthesis. However, such models typically require
external alignment models, which are not necessarily optimized for the decoder
as they are not jointly trained. In this paper, we propose a differentiable
duration method for learning monotonic alignments between input and output
sequences. Our method is based on a soft-duration mechanism that optimizes a
stochastic process in expectation. Using this differentiable duration method, a
direct text to waveform TTS model is introduced to produce raw audio as output
instead of performing neural vocoding. Our model learns to perform
high-fidelity speech synthesis through a combination of adversarial training
and matching the total ground-truth duration. Experimental results show that
our model obtains competitive results while enjoying a much simpler training
pipeline. Audio samples are available online.
Related papers
- High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - A Complementary Joint Training Approach Using Unpaired Speech and Text
for Low-Resource Automatic Speech Recognition [25.473191378558138]
We leverage unpaired data to train a general sequence-to-sequence model.
Inspired by the complementarity of speech-PseudoLabel pair and SynthesizedAudio-text pair, we propose a complementary joint training(CJT) method.
arXiv Detail & Related papers (2022-04-05T07:02:53Z) - JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to
Speech [7.476901945542385]
We present end-to-end text-to-speech (E2E-TTS) model which has a simplified training pipeline and outperforms a cascade of separately learned models.
Our proposed model is jointly trained FastSpeech2 and HiFi-GAN with an alignment module.
Experiments on LJSpeech corpus shows that the proposed model outperforms publicly available, state-of-the-art implementations of ESPNet2-TTS.
arXiv Detail & Related papers (2022-03-31T07:25:11Z) - Low-Latency Incremental Text-to-Speech Synthesis with Distilled Context
Prediction Network [41.4599368523939]
We propose an incremental TTS method that directly predicts the unobserved future context with a lightweight model.
Experimental results show that the proposed method requires about ten times less inference time to achieve comparable synthetic speech quality.
arXiv Detail & Related papers (2021-09-22T13:29:10Z) - Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis [25.234945748885348]
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs.
The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop.
Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2020-11-06T19:30:07Z) - End-to-End Text-to-Speech using Latent Duration based on VQ-VAE [48.151894340550385]
Explicit duration modeling is a key to achieving robust and efficient alignment in text-to-speech synthesis (TTS)
We propose a new TTS framework using explicit duration modeling that incorporates duration as a discrete latent variable to TTS.
arXiv Detail & Related papers (2020-10-19T15:34:49Z) - End-to-End Adversarial Text-to-Speech [33.01223309795122]
We learn to synthesise speech from normalised text or phonemes in an end-to-end manner.
Our proposed generator is feed-forward and thus efficient for both training and inference.
It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses.
arXiv Detail & Related papers (2020-06-05T17:41:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.