Unsupervised Representation Disentanglement using Cross Domain Features
and Adversarial Learning in Variational Autoencoder based Voice Conversion
- URL: http://arxiv.org/abs/2001.07849v3
- Date: Fri, 7 Feb 2020 10:16:28 GMT
- Title: Unsupervised Representation Disentanglement using Cross Domain Features
and Adversarial Learning in Variational Autoencoder based Voice Conversion
- Authors: Wen-Chin Huang, Hao Luo, Hsin-Te Hwang, Chen-Chou Lo, Yu-Huai Peng, Yu
Tsao, Hsin-Min Wang
- Abstract summary: An effective approach for voice conversion (VC) is to disentangle linguistic content from other components in the speech signal.
In this paper, we extend the CDVAE-VC framework by incorporating the concept of adversarial learning.
- Score: 28.085498706505774
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An effective approach for voice conversion (VC) is to disentangle linguistic
content from other components in the speech signal. The effectiveness of
variational autoencoder (VAE) based VC (VAE-VC), for instance, strongly relies
on this principle. In our prior work, we proposed a cross-domain VAE-VC
(CDVAE-VC) framework, which utilized acoustic features of different properties,
to improve the performance of VAE-VC. We believed that the success came from
more disentangled latent representations. In this paper, we extend the CDVAE-VC
framework by incorporating the concept of adversarial learning, in order to
further increase the degree of disentanglement, thereby improving the quality
and similarity of converted speech. More specifically, we first investigate the
effectiveness of incorporating the generative adversarial networks (GANs) with
CDVAE-VC. Then, we consider the concept of domain adversarial training and add
an explicit constraint to the latent representation, realized by a speaker
classifier, to explicitly eliminate the speaker information that resides in the
latent code. Experimental results confirm that the degree of disentanglement of
the learned latent representation can be enhanced by both GANs and the speaker
classifier. Meanwhile, subjective evaluation results in terms of quality and
similarity scores demonstrate the effectiveness of our proposed methods.
Related papers
- Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study
on Speech Emotion Recognition [54.952250732643115]
We study Acoustic Word Embeddings (AWEs), a fixed-length feature derived from continuous representations, to explore their advantages in specific tasks.
AWEs have previously shown utility in capturing acoustic discriminability.
Our findings underscore the acoustic context conveyed by AWEs and showcase the highly competitive Speech Emotion Recognition accuracies.
arXiv Detail & Related papers (2024-02-04T21:24:54Z) - Disentangled Variational Autoencoder for Emotion Recognition in
Conversations [14.92924920489251]
We propose a VAD-disentangled Variational AutoEncoder (VAD-VAE) for Emotion Recognition in Conversations (ERC)
VAD-VAE disentangles three affect representations Valence-Arousal-Dominance (VAD) from the latent space.
Experiments show that VAD-VAE outperforms the state-of-the-art model on two datasets.
arXiv Detail & Related papers (2023-05-23T13:50:06Z) - Adversarial Speaker Disentanglement Using Unannotated External Data for
Self-supervised Representation Based Voice Conversion [35.23123094710891]
We propose a high-similarity any-to-one voice conversion method with the input of SSL representations.
Experimental results show that our proposed method achieves comparable similarity and higher naturalness than the supervised method.
arXiv Detail & Related papers (2023-05-16T04:52:29Z) - Robust Disentangled Variational Speech Representation Learning for
Zero-shot Voice Conversion [34.139871476234205]
We investigate zero-shot voice conversion from a novel perspective of self-supervised disentangled speech representation learning.
A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to a sequential variational autoencoder (VAE) decoder.
On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e. voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.
arXiv Detail & Related papers (2022-03-30T23:03:19Z) - Conditional Deep Hierarchical Variational Autoencoder for Voice
Conversion [5.538544897623972]
Variational autoencoder-based voice conversion (VAE-VC) has the advantage of requiring only pairs of speeches and speaker labels for training.
This paper investigates how an increasing model expressiveness has benefits and impacts on the VAE-VC.
arXiv Detail & Related papers (2021-12-06T05:54:11Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z) - DEAAN: Disentangled Embedding and Adversarial Adaptation Network for
Robust Speaker Representation Learning [69.70594547377283]
We propose a novel framework to disentangle speaker-related and domain-specific features.
Our framework can effectively generate more speaker-discriminative and domain-invariant speaker representations.
arXiv Detail & Related papers (2020-12-12T19:46:56Z) - Spectrum-Guided Adversarial Disparity Learning [52.293230153385124]
We propose a novel end-to-end knowledge directed adversarial learning framework.
It portrays the class-conditioned intraclass disparity using two competitive encoding distributions and learns the purified latent codes by denoising learned disparity.
The experiments on four HAR benchmark datasets demonstrate the robustness and generalization of our proposed methods over a set of state-of-the-art.
arXiv Detail & Related papers (2020-07-14T05:46:27Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.