NoiseVC: Towards High Quality Zero-Shot Voice Conversion
- URL: http://arxiv.org/abs/2104.06074v1
- Date: Tue, 13 Apr 2021 10:12:38 GMT
- Title: NoiseVC: Towards High Quality Zero-Shot Voice Conversion
- Authors: Shijun Wang and Damian Borth
- Abstract summary: NoiseVC is an approach that can disentangle contents based on VQ and Contrastive Predictive Coding (CPC)
We conduct several experiments and demonstrate that NoiseVC has a strong disentanglement ability with a small sacrifice of quality.
- Score: 2.3224617218247126
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Voice conversion (VC) is a task that transforms voice from target audio to
source without losing linguistic contents, it is challenging especially when
source and target speakers are unseen during training (zero-shot VC). Previous
approaches require a pre-trained model or linguistic data to do the zero-shot
conversion. Meanwhile, VC models with Vector Quantization (VQ) or Instance
Normalization (IN) are able to disentangle contents from audios and achieve
successful conversions. However, disentanglement in these models highly relies
on heavily constrained bottleneck layers, thus, the sound quality is
drastically sacrificed. In this paper, we propose NoiseVC, an approach that can
disentangle contents based on VQ and Contrastive Predictive Coding (CPC).
Additionally, Noise Augmentation is performed to further enhance
disentanglement capability. We conduct several experiments and demonstrate that
NoiseVC has a strong disentanglement ability with a small sacrifice of quality.
Related papers
- Takin-VC: Zero-shot Voice Conversion via Jointly Hybrid Content and Memory-Augmented Context-Aware Timbre Modeling [14.98368067290024]
Takin-VC is a novel zero-shot VC framework based on jointly hybrid content and memory-augmented context-aware timbre modeling.
Experimental results demonstrate that the proposed Takin-VC method surpasses state-of-the-art zero-shot VC systems.
arXiv Detail & Related papers (2024-10-02T09:07:33Z) - Robust Disentangled Variational Speech Representation Learning for
Zero-shot Voice Conversion [34.139871476234205]
We investigate zero-shot voice conversion from a novel perspective of self-supervised disentangled speech representation learning.
A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to a sequential variational autoencoder (VAE) decoder.
On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e. voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.
arXiv Detail & Related papers (2022-03-30T23:03:19Z) - Toward Degradation-Robust Voice Conversion [94.60503904292916]
Any-to-any voice conversion technologies convert the vocal timbre of an utterance to any speaker even unseen during training.
It is difficult to collect clean utterances of a speaker, and they are usually degraded by noises or reverberations.
We report in this paper the first comprehensive study on the degradation of robustness of any-to-any voice conversion.
arXiv Detail & Related papers (2021-10-14T17:00:34Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant
Environments [76.98764900754111]
Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker.
We propose Voicy, a new VC framework particularly tailored for noisy speech.
Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder.
arXiv Detail & Related papers (2021-06-16T15:47:06Z) - StarGAN-ZSVC: Towards Zero-Shot Voice Conversion in Low-Resource
Contexts [32.170748231414365]
To be useful in a wider range of contexts, voice conversion systems need to be trainable without access to parallel data.
This paper extends recent voice conversion models based on generative adversarial networks (GANs)
We show that real-time zero-shot voice conversion is possible even for a model trained on very little data.
arXiv Detail & Related papers (2021-05-31T18:21:28Z) - DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion [51.83469048737548]
We propose DiffSVC, an SVC system based on denoising diffusion probabilistic model.
A denoising module is trained in DiffSVC, which takes destroyed mel spectrogram and its corresponding step information as input to predict the added Gaussian noise.
Experiments show that DiffSVC can achieve superior conversion performance in terms of naturalness and voice similarity to current state-of-the-art SVC approaches.
arXiv Detail & Related papers (2021-05-28T14:26:40Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net
architecture [71.45920122349628]
Auto-encoder-based VC methods disentangle the speaker and the content in input speech without given the speaker's identity.
We use the U-Net architecture within an auto-encoder-based VC system to improve audio quality.
arXiv Detail & Related papers (2020-06-07T14:01:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.