OverFlow: Putting flows on top of neural transducers for better TTS
- URL: http://arxiv.org/abs/2211.06892v2
- Date: Mon, 29 May 2023 14:23:44 GMT
- Title: OverFlow: Putting flows on top of neural transducers for better TTS
- Authors: Shivam Mehta, Ambika Kirkland, Harm Lameris, Jonas Beskow, \'Eva
Sz\'ekely, Gustav Eje Henter
- Abstract summary: Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech.
In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics.
- Score: 9.346907121576258
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural HMMs are a type of neural transducer recently proposed for
sequence-to-sequence modelling in text-to-speech. They combine the best
features of classic statistical speech synthesis and modern neural TTS,
requiring less data and fewer training updates, and are less prone to gibberish
output caused by neural attention failures. In this paper, we combine neural
HMM TTS with normalising flows for describing the highly non-Gaussian
distribution of speech acoustics. The result is a powerful, fully probabilistic
model of durations and acoustics that can be trained using exact maximum
likelihood. Experiments show that a system based on our proposal needs fewer
updates than comparable methods to produce accurate pronunciations and a
subjective speech quality close to natural speech. Please see
https://shivammehta25.github.io/OverFlow/ for audio examples and code.
Related papers
- SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.
SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods.
We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z) - EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech [4.91849983180793]
We propose a lightweight Text-to-Speech (TTS) system based on deep convolutional neural networks.
Our model consists of two stages: Text2Spectrum and SSRN.
Experiments show that our model can reduce the training time and parameters while ensuring the quality and naturalness of the synthesized speech.
arXiv Detail & Related papers (2024-03-13T01:27:57Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low
Resource Languages [15.32264927462068]
We propose an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data.
The main idea is to pre-train the model to reconstruct de-warped mel-spectrograms from warped ones.
We empirically demonstrate the effectiveness of our proposed method in low-resource language scenarios.
arXiv Detail & Related papers (2023-03-28T01:26:00Z) - Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [92.55131711064935]
We introduce a language modeling approach for text to speech synthesis (TTS)
Specifically, we train a neural language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio model.
Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech.
arXiv Detail & Related papers (2023-01-05T15:37:15Z) - Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based
On FullConv-TTS [0.0]
We propose a new text-to-speech system based on deep convolutional neural networks that does not employ any RNN components (recurrent units)
At the same time, we improve the generality and robustness of our model through a series of data augmentation methods such as Time Warping, Frequency Mask, and Time Mask.
The final experimental results show that the TTS model using only the CNN component can reduce the training time compared to the classic TTS models such as Tacotron.
arXiv Detail & Related papers (2022-10-24T14:18:43Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - Neural HMMs are all you need (for high-quality attention-free TTS) [13.467456334392594]
We discuss how to combine innovations from both classical and contemporary TTS for best results.
The final system is smaller and simpler than Tacotron 2 and learns to align and speak with fewer iterations.
Unlike Tacotron 2, it also allows easy control over speaking rate.
arXiv Detail & Related papers (2021-08-30T15:38:00Z) - Investigation of learning abilities on linguistic features in
sequence-to-sequence text-to-speech synthesis [48.151894340550385]
Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce high-quality speech directly from text or simple linguistic features such as phonemes.
We investigate under what conditions the neural sequence-to-sequence TTS can work well in Japanese and English.
arXiv Detail & Related papers (2020-05-20T23:26:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.