F0-consistent many-to-many non-parallel voice conversion via conditional
autoencoder
- URL: http://arxiv.org/abs/2004.07370v1
- Date: Wed, 15 Apr 2020 22:00:06 GMT
- Title: F0-consistent many-to-many non-parallel voice conversion via conditional
autoencoder
- Authors: Kaizhi Qian, Zeyu Jin, Mark Hasegawa-Johnson, Gautham J. Mysore
- Abstract summary: We modified and improved autoencoder-based voice conversion to disentangle content, F0, and speaker identity at the same time.
We can control the F0 contour, generate speech with F0 consistent with the target speaker, and significantly improve quality and similarity.
- Score: 53.901873501494606
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Non-parallel many-to-many voice conversion remains an interesting but
challenging speech processing task. Many style-transfer-inspired methods such
as generative adversarial networks (GANs) and variational autoencoders (VAEs)
have been proposed. Recently, AutoVC, a conditional autoencoders (CAEs) based
method achieved state-of-the-art results by disentangling the speaker identity
and speech content using information-constraining bottlenecks, and it achieves
zero-shot conversion by swapping in a different speaker's identity embedding to
synthesize a new voice. However, we found that while speaker identity is
disentangled from speech content, a significant amount of prosodic information,
such as source F0, leaks through the bottleneck, causing target F0 to fluctuate
unnaturally. Furthermore, AutoVC has no control of the converted F0 and thus
unsuitable for many applications. In the paper, we modified and improved
autoencoder-based voice conversion to disentangle content, F0, and speaker
identity at the same time. Therefore, we can control the F0 contour, generate
speech with F0 consistent with the target speaker, and significantly improve
quality and similarity. We support our improvement through quantitative and
qualitative analysis.
Related papers
- Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity.
Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent.
This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z) - DisC-VC: Disentangled and F0-Controllable Neural Voice Conversion [17.83563578034567]
We propose a new variational-autoencoder-based voice conversion model accompanied by an auxiliary network.
We show the effectiveness of the proposed method by objective and subjective evaluations.
arXiv Detail & Related papers (2022-10-20T07:30:07Z) - Robust Disentangled Variational Speech Representation Learning for
Zero-shot Voice Conversion [34.139871476234205]
We investigate zero-shot voice conversion from a novel perspective of self-supervised disentangled speech representation learning.
A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to a sequential variational autoencoder (VAE) decoder.
On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e. voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.
arXiv Detail & Related papers (2022-03-30T23:03:19Z) - Enhancing Zero-Shot Many to Many Voice Conversion with Self-Attention
VAE [8.144263449781967]
Variational auto-encoder(VAE) is an effective neural network architecture to disentangle a speech utterance into speaker identity and linguistic content latent embeddings.
In this work, we found a suitable location of VAE's decoder to add a self-attention layer for incorporating non-local information in generating a converted utterance.
arXiv Detail & Related papers (2022-03-30T03:52:42Z) - Many-to-Many Voice Conversion based Feature Disentanglement using
Variational Autoencoder [2.4975981795360847]
We propose a new method based on feature disentanglement to tackle many to many voice conversion.
The method has the capability to disentangle speaker identity and linguistic content from utterances.
It can convert from many source speakers to many target speakers with a single autoencoder network.
arXiv Detail & Related papers (2021-07-11T13:31:16Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - NVC-Net: End-to-End Adversarial Voice Conversion [7.14505983271756]
NVC-Net is an end-to-end adversarial network that performs voice conversion directly on the raw audio waveform of arbitrary length.
Our model is capable of producing samples at a rate of more than 3600 kHz on an NVIDIA V100 GPU, being orders of magnitude faster than state-of-the-art methods.
arXiv Detail & Related papers (2021-06-02T07:19:58Z) - VAW-GAN for Singing Voice Conversion with Non-parallel Training Data [81.79070894458322]
We propose a singing voice conversion framework based on VAW-GAN.
We train an encoder to disentangle singer identity and singing prosody (F0) from phonetic content.
By conditioning on singer identity and F0, the decoder generates output spectral features with unseen target singer identity.
arXiv Detail & Related papers (2020-08-10T09:44:10Z) - VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net
architecture [71.45920122349628]
Auto-encoder-based VC methods disentangle the speaker and the content in input speech without given the speaker's identity.
We use the U-Net architecture within an auto-encoder-based VC system to improve audio quality.
arXiv Detail & Related papers (2020-06-07T14:01:16Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.