Flowtron: an Autoregressive Flow-based Generative Network for
Text-to-Speech Synthesis
- URL: http://arxiv.org/abs/2005.05957v3
- Date: Thu, 16 Jul 2020 15:10:18 GMT
- Title: Flowtron: an Autoregressive Flow-based Generative Network for
Text-to-Speech Synthesis
- Authors: Rafael Valle, Kevin Shih, Ryan Prenger, Bryan Catanzaro
- Abstract summary: Flowtron is an autoregressive flow-based generative network for text-to-speech synthesis.
We provide results on control of speech variation, between samples and style transfer between speakers seen and unseen during training.
- Score: 23.115879727598262
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper we propose Flowtron: an autoregressive flow-based generative
network for text-to-speech synthesis with control over speech variation and
style transfer. Flowtron borrows insights from IAF and revamps Tacotron in
order to provide high-quality and expressive mel-spectrogram synthesis.
Flowtron is optimized by maximizing the likelihood of the training data, which
makes training simple and stable. Flowtron learns an invertible mapping of data
to a latent space that can be manipulated to control many aspects of speech
synthesis (pitch, tone, speech rate, cadence, accent). Our mean opinion scores
(MOS) show that Flowtron matches state-of-the-art TTS models in terms of speech
quality. In addition, we provide results on control of speech variation,
interpolation between samples and style transfer between speakers seen and
unseen during training. Code and pre-trained models will be made publicly
available at https://github.com/NVIDIA/flowtron
Related papers
- CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models [74.80386066714229]
We present an improved streaming speech synthesis model, CosyVoice 2.
Specifically, we introduce finite-scalar quantization to improve codebook utilization of speech tokens.
We develop a chunk-aware causal flow matching model to support various synthesis scenarios.
arXiv Detail & Related papers (2024-12-13T12:59:39Z) - V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow [57.51550409392103]
We introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos.
To address these challenges, we decompose the speech signal into manageable subspaces, each representing distinct speech attributes, and predict them directly from the visual input.
To generate coherent and realistic speech from these predicted attributes, we employ a rectified flow matching decoder built on a Transformer architecture.
arXiv Detail & Related papers (2024-11-29T05:55:20Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Predicting phoneme-level prosody latents using AR and flow-based Prior
Networks for expressive speech synthesis [3.6159128762538018]
We show that normalizing flow based prior networks can result in more expressive speech at the cost of a slight drop in quality.
We also propose a Dynamical VAE model, that can generate higher quality speech although with decreased expressiveness and variability compared to the flow based models.
arXiv Detail & Related papers (2022-11-02T17:45:01Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to
Speech [7.476901945542385]
We present end-to-end text-to-speech (E2E-TTS) model which has a simplified training pipeline and outperforms a cascade of separately learned models.
Our proposed model is jointly trained FastSpeech2 and HiFi-GAN with an alignment module.
Experiments on LJSpeech corpus shows that the proposed model outperforms publicly available, state-of-the-art implementations of ESPNet2-TTS.
arXiv Detail & Related papers (2022-03-31T07:25:11Z) - GANtron: Emotional Speech Synthesis with Generative Adversarial Networks [0.0]
We propose a text-to-speech model where the inferred speech can be tuned with the desired emotions.
We use Generative Adversarial Networks (GANs) together with a sequence-to-sequence model using an attention mechanism.
arXiv Detail & Related papers (2021-10-06T10:44:30Z) - Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis [25.234945748885348]
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs.
The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop.
Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2020-11-06T19:30:07Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.