Related papers: Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance

Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance

URL: http://arxiv.org/abs/2106.13479v1
Date: Fri, 25 Jun 2021 07:51:35 GMT
Title: Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance
Authors: Hieu-Thi Luong and Junichi Yamagishi
Abstract summary: We investigate the use of quantized vectors to model the latent linguistic embedding. By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding. Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
Score: 55.10864476206503
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generally speaking, the main objective when training a neural speech synthesis system is to synthesize natural and expressive speech from the output layer of the neural network without much attention given to the hidden layers. However, by learning useful latent representation, the system can be used for many more practical scenarios. In this paper, we investigate the use of quantized vectors to model the latent linguistic embedding and compare it with the continuous counterpart. By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding that takes on different properties while having a similar performance in terms of quality and speaker similarity. Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations, but has a discrete latent space that is useful for reducing the representation bit-rate, which is desirable for data transferring, or limiting the information leaking, which is important for speaker anonymization and other tasks of that nature.

Related papers

Investigating Disentanglement in a Phoneme-level Speech Codec for Prosody Modeling [39.80957479349776]
We investigate the prosody modeling capabilities of the discrete space of an RVQ-VAE model, modifying it to operate on the phoneme-level. We show that the phoneme-level discrete latent representations achieves a high degree of disentanglement, capturing fine-grained prosodic information that is robust and transferable.
arXiv Detail & Related papers (2024-09-13T09:27:05Z)
Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data. The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z)
Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis [3.691712391306624]
We show that the fine-grained latent space also captures coarse-grained information, which is more evident as the dimension of latent space increases in order to capture diverse prosodic representations. We alleviate this issue by first capturing rich speech attributes into a token-level latent space and then, separately train a prior network that given the input text, learns utterance-level representations in order to predict the phoneme-level, posterior latents extracted during the previous step.
arXiv Detail & Related papers (2022-11-01T15:17:25Z)
Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z)
Deep Learning For Prominence Detection In Children's Read Speech [13.041607703862724]
We present a system that operates on segmented speech waveforms to learn features relevant to prominent word detection for children's oral fluency assessment. The chosen CRNN (convolutional recurrent neural network) framework, incorporating both word-level features and sequence information, is found to benefit from the perceptually motivated SincNet filters.
arXiv Detail & Related papers (2021-10-27T08:51:42Z)
VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement. We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training. Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z)
FluentNet: End-to-End Detection of Speech Disfluency with Deep Learning [23.13972240042859]
We propose an end-to-end deep neural network, FluentNet, capable of detecting a number of different disfluency types. FluentNet consists of a Squeeze-and-Excitation Residual convolutional neural network which facilitate the learning of strong spectral frame-level representations. We present a disfluency dataset based on the public LibriSpeech dataset with synthesized stutters.
arXiv Detail & Related papers (2020-09-23T21:51:29Z)
Mechanisms for Handling Nested Dependencies in Neural-Network Language Models and Humans [75.15855405318855]
We studied whether a modern artificial neural network trained with "deep learning" methods mimics a central aspect of human sentence processing. Although the network was solely trained to predict the next word in a large corpus, analysis showed the emergence of specialized units that successfully handled local and long-distance syntactic agreement. We tested the model's predictions in a behavioral experiment where humans detected violations in number agreement in sentences with systematic variations in the singular/plural status of multiple nouns.
arXiv Detail & Related papers (2020-06-19T12:00:05Z)
Improve Variational Autoencoder for Text Generationwith Discrete Latent Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning. VAEs tend to ignore latent variables with a strong auto-regressive decoder. We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)
Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features. We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.