Conditional Deep Hierarchical Variational Autoencoder for Voice
Conversion
- URL: http://arxiv.org/abs/2112.02796v1
- Date: Mon, 6 Dec 2021 05:54:11 GMT
- Title: Conditional Deep Hierarchical Variational Autoencoder for Voice
Conversion
- Authors: Kei Akuzawa, Kotaro Onishi, Keisuke Takiguchi, Kohki Mametani,
Koichiro Mori
- Abstract summary: Variational autoencoder-based voice conversion (VAE-VC) has the advantage of requiring only pairs of speeches and speaker labels for training.
This paper investigates how an increasing model expressiveness has benefits and impacts on the VAE-VC.
- Score: 5.538544897623972
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Variational autoencoder-based voice conversion (VAE-VC) has the advantage of
requiring only pairs of speeches and speaker labels for training. Unlike the
majority of the research in VAE-VC which focuses on utilizing auxiliary losses
or discretizing latent variables, this paper investigates how an increasing
model expressiveness has benefits and impacts on the VAE-VC. Specifically, we
first analyze VAE-VC from a rate-distortion perspective, and point out that
model expressiveness is significant for VAE-VC because rate and distortion
reflect similarity and naturalness of converted speeches. Based on the
analysis, we propose a novel VC method using a deep hierarchical VAE, which has
high model expressiveness as well as having fast conversion speed thanks to its
non-autoregressive decoder. Also, our analysis reveals another problem that
similarity can be degraded when the latent variable of VAEs has redundant
information. We address the problem by controlling the information contained in
the latent variable using $\beta$-VAE objective. In the experiment using VCTK
corpus, the proposed method achieved mean opinion scores higher than 3.5 on
both naturalness and similarity in inter-gender settings, which are higher than
the scores of existing autoencoder-based VC methods.
Related papers
- Disentangled Variational Autoencoder for Emotion Recognition in
Conversations [14.92924920489251]
We propose a VAD-disentangled Variational AutoEncoder (VAD-VAE) for Emotion Recognition in Conversations (ERC)
VAD-VAE disentangles three affect representations Valence-Arousal-Dominance (VAD) from the latent space.
Experiments show that VAD-VAE outperforms the state-of-the-art model on two datasets.
arXiv Detail & Related papers (2023-05-23T13:50:06Z) - Robust Disentangled Variational Speech Representation Learning for
Zero-shot Voice Conversion [34.139871476234205]
We investigate zero-shot voice conversion from a novel perspective of self-supervised disentangled speech representation learning.
A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to a sequential variational autoencoder (VAE) decoder.
On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e. voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.
arXiv Detail & Related papers (2022-03-30T23:03:19Z) - AAVAE: Augmentation-Augmented Variational Autoencoders [43.73699420145321]
We introduce augmentation-augmented variational autoencoders (AAVAE), a third approach to self-supervised learning based on autoencoding.
We empirically evaluate the proposed AAVAE on image classification, similar to how recent contrastive and non-contrastive learning algorithms have been evaluated.
arXiv Detail & Related papers (2021-07-26T17:04:30Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion [51.83469048737548]
We propose DiffSVC, an SVC system based on denoising diffusion probabilistic model.
A denoising module is trained in DiffSVC, which takes destroyed mel spectrogram and its corresponding step information as input to predict the added Gaussian noise.
Experiments show that DiffSVC can achieve superior conversion performance in terms of naturalness and voice similarity to current state-of-the-art SVC approaches.
arXiv Detail & Related papers (2021-05-28T14:26:40Z) - Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual
Speech Enhancement [26.596930749375474]
We introduce the use of a latent sequential variable with Markovian dependencies to switch between different VAE architectures through time.
We derive the corresponding variational expectation-maximization algorithm to estimate the parameters of the model and enhance the speech signal.
arXiv Detail & Related papers (2021-02-08T11:45:02Z) - Hierarchical Variational Autoencoder for Visual Counterfactuals [79.86967775454316]
Conditional Variational Autos (VAE) are gathering significant attention as an Explainable Artificial Intelligence (XAI) tool.
In this paper we show how relaxing the effect of the posterior leads to successful counterfactuals.
We introduce VAEX an Hierarchical VAE designed for this approach that can visually audit a classifier in applications.
arXiv Detail & Related papers (2021-02-01T14:07:11Z) - Autoencoding Variational Autoencoder [56.05008520271406]
We study the implications of this behaviour on the learned representations and also the consequences of fixing it by introducing a notion of self consistency.
We show that encoders trained with our self-consistency approach lead to representations that are robust (insensitive) to perturbations in the input introduced by adversarial attacks.
arXiv Detail & Related papers (2020-12-07T14:16:14Z) - Simple and Effective VAE Training with Calibrated Decoders [123.08908889310258]
Variational autoencoders (VAEs) provide an effective and simple method for modeling complex distributions.
We study the impact of calibrated decoders, which learn the uncertainty of the decoding distribution.
We propose a simple but novel modification to the commonly used Gaussian decoder, which computes the prediction variance analytically.
arXiv Detail & Related papers (2020-06-23T17:57:47Z) - F0-consistent many-to-many non-parallel voice conversion via conditional
autoencoder [53.901873501494606]
We modified and improved autoencoder-based voice conversion to disentangle content, F0, and speaker identity at the same time.
We can control the F0 contour, generate speech with F0 consistent with the target speaker, and significantly improve quality and similarity.
arXiv Detail & Related papers (2020-04-15T22:00:06Z) - Unsupervised Representation Disentanglement using Cross Domain Features
and Adversarial Learning in Variational Autoencoder based Voice Conversion [28.085498706505774]
An effective approach for voice conversion (VC) is to disentangle linguistic content from other components in the speech signal.
In this paper, we extend the CDVAE-VC framework by incorporating the concept of adversarial learning.
arXiv Detail & Related papers (2020-01-22T02:06:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.