Generating diverse and natural text-to-speech samples using a quantized
fine-grained VAE and auto-regressive prosody prior
- URL: http://arxiv.org/abs/2002.03788v1
- Date: Thu, 6 Feb 2020 12:35:50 GMT
- Title: Generating diverse and natural text-to-speech samples using a quantized
fine-grained VAE and auto-regressive prosody prior
- Authors: Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Andrew
Rosenberg, Bhuvana Ramabhadran, Yonghui Wu
- Abstract summary: This paper proposes a sequential prior in a discrete latent space which can generate more naturally sounding samples.
We evaluate the approach using listening tests, objective metrics of automatic speech recognition (ASR) performance, and measurements of prosody attributes.
- Score: 53.69310441063162
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent neural text-to-speech (TTS) models with fine-grained latent features
enable precise control of the prosody of synthesized speech. Such models
typically incorporate a fine-grained variational autoencoder (VAE) structure,
extracting latent features at each input token (e.g., phonemes). However,
generating samples with the standard VAE prior often results in unnatural and
discontinuous speech, with dramatic prosodic variation between tokens. This
paper proposes a sequential prior in a discrete latent space which can generate
more naturally sounding samples. This is accomplished by discretizing the
latent features using vector quantization (VQ), and separately training an
autoregressive (AR) prior model over the result. We evaluate the approach using
listening tests, objective metrics of automatic speech recognition (ASR)
performance, and measurements of prosody attributes. Experimental results show
that the proposed model significantly improves the naturalness in random sample
generation. Furthermore, initial experiments demonstrate that randomly sampling
from the proposed model can be used as data augmentation to improve the ASR
performance.
Related papers
- Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS)
MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.
arXiv Detail & Related papers (2024-07-11T14:36:53Z) - Period VITS: Variational Inference with Explicit Pitch Modeling for
End-to-end Emotional Speech Synthesis [19.422230767803246]
We propose Period VITS, a novel end-to-end text-to-speech model that incorporates an explicit periodicity generator.
In the proposed method, we introduce a frame pitch predictor that predicts prosodic features, such as pitch and voicing flags, from the input text.
From these features, the proposed periodicity generator produces a sample-level sinusoidal source that enables the waveform decoder to accurately reproduce the pitch.
arXiv Detail & Related papers (2022-10-28T07:52:30Z) - Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech [27.84124625934247]
Cross-utterance conditional VAE is proposed to estimate a posterior probability distribution of the latent prosody features for each phoneme.
CUC-VAE allows sampling from an utterance-specific prior distribution conditioned on cross-utterance information.
Experimental results on LJ-Speech and LibriTTS data show that the proposed CUC-VAE TTS system improves naturalness and prosody diversity with clear margins.
arXiv Detail & Related papers (2022-05-09T08:39:53Z) - Speech Enhancement with Score-Based Generative Models in the Complex
STFT Domain [18.090665052145653]
We propose a novel training task for speech enhancement using a complex-valued deep neural network.
We derive this training task within the formalism of differential equations, thereby enabling the use of predictor-corrector samplers.
arXiv Detail & Related papers (2022-03-31T12:53:47Z) - Listen, Adapt, Better WER: Source-free Single-utterance Test-time
Adaptation for Automatic Speech Recognition [65.84978547406753]
Test-time Adaptation aims to adapt the model trained on source domains to yield better predictions for test samples.
Single-Utterance Test-time Adaptation (SUTA) is the first TTA study in speech area to our best knowledge.
arXiv Detail & Related papers (2022-03-27T06:38:39Z) - Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step.
We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - A learned conditional prior for the VAE acoustic space of a TTS system [17.26941119364184]
Generative models, such as variational autoencoders (VAEs), capture this variability and allow multiple renditions of the same sentence via sampling.
We propose a novel method to compute an informative prior for the VAE latent space of a neural text-to-speech (TTS) system.
arXiv Detail & Related papers (2021-06-14T15:36:16Z) - Hierarchical Multi-Grained Generative Model for Expressive Speech
Synthesis [19.386519810463003]
This paper proposes a hierarchical generative model with a multi-grained latent variable to synthesize expressive speech.
Our proposed framework also provides the controllability of speaking style in an entire utterance.
arXiv Detail & Related papers (2020-09-17T18:00:19Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.