Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech
- URL: http://arxiv.org/abs/2205.04120v1
- Date: Mon, 9 May 2022 08:39:53 GMT
- Title: Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech
- Authors: Yang Li, Cheng Yu, Guangzhi Sun, Hua Jiang, Fanglei Sun, Weiqin Zu,
Ying Wen, Yang Yang, Jun Wang
- Abstract summary: Cross-utterance conditional VAE is proposed to estimate a posterior probability distribution of the latent prosody features for each phoneme.
CUC-VAE allows sampling from an utterance-specific prior distribution conditioned on cross-utterance information.
Experimental results on LJ-Speech and LibriTTS data show that the proposed CUC-VAE TTS system improves naturalness and prosody diversity with clear margins.
- Score: 27.84124625934247
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modelling prosody variation is critical for synthesizing natural and
expressive speech in end-to-end text-to-speech (TTS) systems. In this paper, a
cross-utterance conditional VAE (CUC-VAE) is proposed to estimate a posterior
probability distribution of the latent prosody features for each phoneme by
conditioning on acoustic features, speaker information, and text features
obtained from both past and future sentences. At inference time, instead of the
standard Gaussian distribution used by VAE, CUC-VAE allows sampling from an
utterance-specific prior distribution conditioned on cross-utterance
information, which allows the prosody features generated by the TTS system to
be related to the context and is more similar to how humans naturally produce
prosody. The performance of CUC-VAE is evaluated via a qualitative listening
test for naturalness, intelligibility and quantitative measurements, including
word error rates and the standard deviation of prosody attributes. Experimental
results on LJ-Speech and LibriTTS data show that the proposed CUC-VAE TTS
system improves naturalness and prosody diversity with clear margins.
Related papers
- Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization [34.51491788470738]
We propose reverse inference optimization (RIO) to enhance the robustness of autoregressive-model-based text-to-speech (TTS) systems.
RIO uses reverse inference as the standard to select exemplars used in RLHF from the speech samples generated by the TTS system itself.
RIO significantly improves the stability of zero-shot TTS performance by reducing the discrepancies between training and inference conditions.
arXiv Detail & Related papers (2024-07-02T13:04:04Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level
Quality [123.97136358092585]
We develop a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset.
Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation.
Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS to human recordings at the sentence level.
arXiv Detail & Related papers (2022-05-09T16:57:35Z) - Listen, Adapt, Better WER: Source-free Single-utterance Test-time
Adaptation for Automatic Speech Recognition [65.84978547406753]
Test-time Adaptation aims to adapt the model trained on source domains to yield better predictions for test samples.
Single-Utterance Test-time Adaptation (SUTA) is the first TTA study in speech area to our best knowledge.
arXiv Detail & Related papers (2022-03-27T06:38:39Z) - ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in
Text-to-Speech [96.0009517132463]
We introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV)
We then introduce an LPV predictor, which predicts LPV given word sequence and fine-tune it on the high-quality TTS dataset.
Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods.
arXiv Detail & Related papers (2022-02-16T01:42:32Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - On the Interplay Between Sparsity, Naturalness, Intelligibility, and
Prosody in Speech Synthesis [102.80458458550999]
We investigate the tradeoffs between sparstiy and its subsequent effects on synthetic speech.
Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility.
arXiv Detail & Related papers (2021-10-04T02:03:28Z) - A learned conditional prior for the VAE acoustic space of a TTS system [17.26941119364184]
Generative models, such as variational autoencoders (VAEs), capture this variability and allow multiple renditions of the same sentence via sampling.
We propose a novel method to compute an informative prior for the VAE latent space of a neural text-to-speech (TTS) system.
arXiv Detail & Related papers (2021-06-14T15:36:16Z) - Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech [4.348588963853261]
We introduce Grad-TTS, a novel text-to-speech model with score-based decoder producing mel-spectrograms.
The framework of flexible differential equations helps us to generalize conventional diffusion probabilistic models.
Subjective human evaluation shows that Grad-TTS is competitive with state-of-the-art text-to-speech approaches in terms of Mean Opinion Score.
arXiv Detail & Related papers (2021-05-13T14:47:44Z) - Generating diverse and natural text-to-speech samples using a quantized
fine-grained VAE and auto-regressive prosody prior [53.69310441063162]
This paper proposes a sequential prior in a discrete latent space which can generate more naturally sounding samples.
We evaluate the approach using listening tests, objective metrics of automatic speech recognition (ASR) performance, and measurements of prosody attributes.
arXiv Detail & Related papers (2020-02-06T12:35:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.