A learned conditional prior for the VAE acoustic space of a TTS system
- URL: http://arxiv.org/abs/2106.10229v1
- Date: Mon, 14 Jun 2021 15:36:16 GMT
- Title: A learned conditional prior for the VAE acoustic space of a TTS system
- Authors: Penny Karanasou, Sri Karlapati, Alexis Moinet, Arnaud Joly, Ammar
Abbas, Simon Slangen, Jaime Lorenzo Trueba, Thomas Drugman
- Abstract summary: Generative models, such as variational autoencoders (VAEs), capture this variability and allow multiple renditions of the same sentence via sampling.
We propose a novel method to compute an informative prior for the VAE latent space of a neural text-to-speech (TTS) system.
- Score: 17.26941119364184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many factors influence speech yielding different renditions of a given
sentence. Generative models, such as variational autoencoders (VAEs), capture
this variability and allow multiple renditions of the same sentence via
sampling. The degree of prosodic variability depends heavily on the prior that
is used when sampling. In this paper, we propose a novel method to compute an
informative prior for the VAE latent space of a neural text-to-speech (TTS)
system. By doing so, we aim to sample with more prosodic variability, while
gaining controllability over the latent space's structure.
By using as prior the posterior distribution of a secondary VAE, which we
condition on a speaker vector, we can sample from the primary VAE taking
explicitly the conditioning into account and resulting in samples from a
specific region of the latent space for each condition (i.e. speaker). A formal
preference test demonstrates significant preference of the proposed approach
over standard Conditional VAE. We also provide visualisations of the latent
space where well-separated condition-specific clusters appear, as well as
ablation studies to better understand the behaviour of the system.
Related papers
- Conditional Sampling of Variational Autoencoders via Iterated
Approximate Ancestral Sampling [7.357511266926065]
Conditional sampling of variational autoencoders (VAEs) is needed in various applications, such as missing data imputation, but is computationally intractable.
A principled choice forally exact conditional sampling is Metropolis-within-Gibbs (MWG)
arXiv Detail & Related papers (2023-08-17T16:08:18Z) - Conformal Language Modeling [61.94417935386489]
We propose a novel approach to conformal prediction for generative language models (LMs)
Standard conformal prediction produces prediction sets with rigorous, statistical guarantees.
We demonstrate the promise of our approach on multiple tasks in open-domain question answering, text summarization, and radiology report generation.
arXiv Detail & Related papers (2023-06-16T21:55:08Z) - Structured Voronoi Sampling [61.629198273926676]
In this paper, we take an important step toward building a principled approach for sampling from language models with gradient-based methods.
We name our gradient-based technique Structured Voronoi Sampling (SVS)
In a controlled generation task, SVS is able to generate fluent and diverse samples while following the control targets significantly better than other methods.
arXiv Detail & Related papers (2023-06-05T17:32:35Z) - Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech [27.84124625934247]
Cross-utterance conditional VAE is proposed to estimate a posterior probability distribution of the latent prosody features for each phoneme.
CUC-VAE allows sampling from an utterance-specific prior distribution conditioned on cross-utterance information.
Experimental results on LJ-Speech and LibriTTS data show that the proposed CUC-VAE TTS system improves naturalness and prosody diversity with clear margins.
arXiv Detail & Related papers (2022-05-09T08:39:53Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z) - Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual
Speech Enhancement [26.596930749375474]
We introduce the use of a latent sequential variable with Markovian dependencies to switch between different VAE architectures through time.
We derive the corresponding variational expectation-maximization algorithm to estimate the parameters of the model and enhance the speech signal.
arXiv Detail & Related papers (2021-02-08T11:45:02Z) - A Contrastive Learning Approach for Training Variational Autoencoder
Priors [137.62674958536712]
Variational autoencoders (VAEs) are one of the powerful likelihood-based generative models with applications in many domains.
One explanation for VAEs' poor generative quality is the prior hole problem: the prior distribution fails to match the aggregate approximate posterior.
We propose an energy-based prior defined by the product of a base prior distribution and a reweighting factor, designed to bring the base closer to the aggregate posterior.
arXiv Detail & Related papers (2020-10-06T17:59:02Z) - Generating diverse and natural text-to-speech samples using a quantized
fine-grained VAE and auto-regressive prosody prior [53.69310441063162]
This paper proposes a sequential prior in a discrete latent space which can generate more naturally sounding samples.
We evaluate the approach using listening tests, objective metrics of automatic speech recognition (ASR) performance, and measurements of prosody attributes.
arXiv Detail & Related papers (2020-02-06T12:35:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.