NoiseVC: Towards High Quality Zero-Shot Voice Conversion
        - URL: http://arxiv.org/abs/2104.06074v1
- Date: Tue, 13 Apr 2021 10:12:38 GMT
- Title: NoiseVC: Towards High Quality Zero-Shot Voice Conversion
- Authors: Shijun Wang and Damian Borth
- Abstract summary: NoiseVC is an approach that can disentangle contents based on VQ and Contrastive Predictive Coding (CPC)
We conduct several experiments and demonstrate that NoiseVC has a strong disentanglement ability with a small sacrifice of quality.
- Score: 2.3224617218247126
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract:   Voice conversion (VC) is a task that transforms voice from target audio to
source without losing linguistic contents, it is challenging especially when
source and target speakers are unseen during training (zero-shot VC). Previous
approaches require a pre-trained model or linguistic data to do the zero-shot
conversion. Meanwhile, VC models with Vector Quantization (VQ) or Instance
Normalization (IN) are able to disentangle contents from audios and achieve
successful conversions. However, disentanglement in these models highly relies
on heavily constrained bottleneck layers, thus, the sound quality is
drastically sacrificed. In this paper, we propose NoiseVC, an approach that can
disentangle contents based on VQ and Contrastive Predictive Coding (CPC).
Additionally, Noise Augmentation is performed to further enhance
disentanglement capability. We conduct several experiments and demonstrate that
NoiseVC has a strong disentanglement ability with a small sacrifice of quality.
 
      
        Related papers
        - AdaptVC: High Quality Voice Conversion with Adaptive Learning [28.25726543043742]
 Key challenge is to extract disentangled linguistic content from the source and voice style from the reference.
In this paper, we achieve successful disentanglement of content and speaker features by tuning self-supervised speech features with adapters.
The adapters are trained to dynamically encode nuanced features from rich self-supervised features, and the decoder fuses them to produce speech that accurately resembles the reference.
 arXiv  Detail & Related papers  (2025-01-02T16:54:08Z)
- Takin-VC: Zero-shot Voice Conversion via Jointly Hybrid Content and   Memory-Augmented Context-Aware Timbre Modeling [14.98368067290024]
 Takin-VC is a novel zero-shot VC framework based on jointly hybrid content and memory-augmented context-aware timbre modeling.
 Experimental results demonstrate that the proposed Takin-VC method surpasses state-of-the-art zero-shot VC systems.
 arXiv  Detail & Related papers  (2024-10-02T09:07:33Z)
- Robust Disentangled Variational Speech Representation Learning for
  Zero-shot Voice Conversion [34.139871476234205]
 We investigate zero-shot voice conversion from a novel perspective of self-supervised disentangled speech representation learning.
A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to a sequential variational autoencoder (VAE) decoder.
On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e. voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.
 arXiv  Detail & Related papers  (2022-03-30T23:03:19Z)
- Toward Degradation-Robust Voice Conversion [94.60503904292916]
 Any-to-any voice conversion technologies convert the vocal timbre of an utterance to any speaker even unseen during training.
It is difficult to collect clean utterances of a speaker, and they are usually degraded by noises or reverberations.
We report in this paper the first comprehensive study on the degradation of robustness of any-to-any voice conversion.
 arXiv  Detail & Related papers  (2021-10-14T17:00:34Z)
- VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
  Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
 One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
 Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
 arXiv  Detail & Related papers  (2021-06-18T13:50:38Z)
- Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant
  Environments [76.98764900754111]
 Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker.
We propose Voicy, a new VC framework particularly tailored for noisy speech.
Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder.
 arXiv  Detail & Related papers  (2021-06-16T15:47:06Z)
- StarGAN-ZSVC: Towards Zero-Shot Voice Conversion in Low-Resource
  Contexts [32.170748231414365]
 To be useful in a wider range of contexts, voice conversion systems need to be trainable without access to parallel data.
This paper extends recent voice conversion models based on generative adversarial networks (GANs)
We show that real-time zero-shot voice conversion is possible even for a model trained on very little data.
 arXiv  Detail & Related papers  (2021-05-31T18:21:28Z)
- DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion [51.83469048737548]
 We propose DiffSVC, an SVC system based on denoising diffusion probabilistic model.
A denoising module is trained in DiffSVC, which takes destroyed mel spectrogram and its corresponding step information as input to predict the added Gaussian noise.
Experiments show that DiffSVC can achieve superior conversion performance in terms of naturalness and voice similarity to current state-of-the-art SVC approaches.
 arXiv  Detail & Related papers  (2021-05-28T14:26:40Z)
- Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
 Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
 arXiv  Detail & Related papers  (2020-08-07T11:02:07Z)
- VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net
  architecture [71.45920122349628]
 Auto-encoder-based VC methods disentangle the speaker and the content in input speech without given the speaker's identity.
We use the U-Net architecture within an auto-encoder-based VC system to improve audio quality.
 arXiv  Detail & Related papers  (2020-06-07T14:01:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.