Discrete Acoustic Space for an Efficient Sampling in Neural
Text-To-Speech
- URL: http://arxiv.org/abs/2110.12539v3
- Date: Thu, 14 Sep 2023 12:34:51 GMT
- Title: Discrete Acoustic Space for an Efficient Sampling in Neural
Text-To-Speech
- Authors: Marek Strong, Jonas Rohnke, Antonio Bonafonte, Mateusz {\L}ajszczak,
Trevor Wood
- Abstract summary: We present a Split Vector Quantized Variational Autoencoder (SVQ-VAE) architecture using a split vector quantizer for NTTS.
We demonstrate that the SVQ-VAE latent acoustic space is predictable from text, reducing the gap between the standard constant vector synthesis and vocoded recordings by 32%.
- Score: 5.857339910247513
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a Split Vector Quantized Variational Autoencoder (SVQ-VAE)
architecture using a split vector quantizer for NTTS, as an enhancement to the
well-known Variational Autoencoder (VAE) and Vector Quantized Variational
Autoencoder (VQ-VAE) architectures. Compared to these previous architectures,
our proposed model retains the benefits of using an utterance-level bottleneck,
while keeping significant representation power and a discretized latent space
small enough for efficient prediction from text. We train the model on
recordings in the expressive task-oriented dialogues domain and show that
SVQ-VAE achieves a statistically significant improvement in naturalness over
the VAE and VQ-VAE models. Furthermore, we demonstrate that the SVQ-VAE latent
acoustic space is predictable from text, reducing the gap between the standard
constant vector synthesis and vocoded recordings by 32%.
Related papers
- VQalAttent: a Transparent Speech Generation Pipeline based on Transformer-learned VQ-VAE Latent Space [0.49109372384514843]
VQalAttent is a lightweight model designed to generate fake speech with tunable performance and interpretability.
Our results demonstrate VQalAttent's capacity to generate intelligible speech samples with limited computational resources.
arXiv Detail & Related papers (2024-11-22T00:21:39Z) - WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [65.30937248905958]
A crucial component of language models is the tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens.
We introduce WavTokenizer, which offers several advantages over previous SOTA acoustic models in the audio domain.
WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information.
arXiv Detail & Related papers (2024-08-29T13:43:36Z) - Self-Supervised Speech Quality Estimation and Enhancement Using Only
Clean Speech [50.95292368372455]
We propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variational autoencoder (VQ-VAE)
The training of VQ-VAE relies on clean speech; hence, large quantization errors can be expected when the speech is distorted.
We found that the vector quantization mechanism could also be used for self-supervised speech enhancement (SE) model training.
arXiv Detail & Related papers (2024-02-26T06:01:38Z) - Speech Modeling with a Hierarchical Transformer Dynamical VAE [23.847366888695266]
We propose to model speech signals with the Hierarchical Transformer DVAE (HiT-DVAE)
We show that HiT-DVAE outperforms several other DVAEs for speech spectrogram modeling, while enabling a simpler training procedure.
arXiv Detail & Related papers (2023-03-07T13:35:45Z) - Vector Quantized Wasserstein Auto-Encoder [57.29764749855623]
We study learning deep discrete representations from the generative viewpoint.
We endow discrete distributions over sequences of codewords and learn a deterministic decoder that transports the distribution over the sequences of codewords to the data distribution.
We develop further theories to connect it with the clustering viewpoint of WS distance, allowing us to have a better and more controllable clustering solution.
arXiv Detail & Related papers (2023-02-12T13:51:36Z) - Unsupervised Speech Enhancement using Dynamical Variational
Auto-Encoders [29.796695365217893]
Dynamical variational auto-encoders (DVAEs) are a class of deep generative models with latent variables.
We propose an unsupervised speech enhancement algorithm based on the most general form of DVAEs.
We derive a variational expectation-maximization algorithm to perform speech enhancement.
arXiv Detail & Related papers (2021-06-23T09:48:38Z) - Learning Robust Latent Representations for Controllable Speech Synthesis [0.0]
We propose RTI-VAE (Reordered Transformer with Information reduction VAE) to minimize the mutual information between different latent variables.
We show that RTI-VAE reduces the cluster overlap of speaker attributes by at least 30% over LSTM-VAE and by at least 7% over vanilla Transformer-VAE.
arXiv Detail & Related papers (2021-05-10T15:49:03Z) - Self-Supervised Variational Auto-Encoders [10.482805367361818]
We present a novel class of generative models, called self-supervised Variational Auto-Encoder (selfVAE)
This class of models allows to perform both conditional and unconditional sampling, while simplifying the objective function.
We present performance of our approach on three benchmark image data (Cifar10, Imagenette64, and CelebA)
arXiv Detail & Related papers (2020-10-05T13:42:28Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - DiscreTalk: Text-to-Speech as a Machine Translation Problem [52.33785857500754]
This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT)
The proposed model consists of two components; a non-autoregressive vector quantized variational autoencoder (VQ-VAE) model and an autoregressive Transformer-NMT model.
arXiv Detail & Related papers (2020-05-12T02:45:09Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.