VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net
architecture
- URL: http://arxiv.org/abs/2006.04154v1
- Date: Sun, 7 Jun 2020 14:01:16 GMT
- Title: VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net
architecture
- Authors: Da-Yi Wu, Yen-Hao Chen, Hung-Yi Lee
- Abstract summary: Auto-encoder-based VC methods disentangle the speaker and the content in input speech without given the speaker's identity.
We use the U-Net architecture within an auto-encoder-based VC system to improve audio quality.
- Score: 71.45920122349628
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Voice conversion (VC) is a task that transforms the source speaker's timbre,
accent, and tones in audio into another one's while preserving the linguistic
content. It is still a challenging work, especially in a one-shot setting.
Auto-encoder-based VC methods disentangle the speaker and the content in input
speech without given the speaker's identity, so these methods can further
generalize to unseen speakers. The disentangle capability is achieved by vector
quantization (VQ), adversarial training, or instance normalization (IN).
However, the imperfect disentanglement may harm the quality of output speech.
In this work, to further improve audio quality, we use the U-Net architecture
within an auto-encoder-based VC system. We find that to leverage the U-Net
architecture, a strong information bottleneck is necessary. The VQ-based
method, which quantizes the latent vectors, can serve the purpose. The
objective and the subjective evaluations show that the proposed method performs
well in both audio naturalness and speaker similarity.
Related papers
- HiFi-VC: High Quality ASR-Based Voice Conversion [0.0]
We propose a new any-to-any voice conversion pipeline.
Our approach uses automated speech recognition features, pitch tracking, and a state-of-the-art waveform prediction model.
arXiv Detail & Related papers (2022-03-31T10:45:32Z) - VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge
transfer from voice conversion [77.50171525265056]
This paper proposes a novel multi-speaker Video-to-Speech (VTS) system based on cross-modal knowledge transfer from voice conversion (VC)
The Lip2Ind network can substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content.
arXiv Detail & Related papers (2022-02-18T08:58:45Z) - Many-to-Many Voice Conversion based Feature Disentanglement using
Variational Autoencoder [2.4975981795360847]
We propose a new method based on feature disentanglement to tackle many to many voice conversion.
The method has the capability to disentangle speaker identity and linguistic content from utterances.
It can convert from many source speakers to many target speakers with a single autoencoder network.
arXiv Detail & Related papers (2021-07-11T13:31:16Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant
Environments [76.98764900754111]
Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker.
We propose Voicy, a new VC framework particularly tailored for noisy speech.
Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder.
arXiv Detail & Related papers (2021-06-16T15:47:06Z) - NVC-Net: End-to-End Adversarial Voice Conversion [7.14505983271756]
NVC-Net is an end-to-end adversarial network that performs voice conversion directly on the raw audio waveform of arbitrary length.
Our model is capable of producing samples at a rate of more than 3600 kHz on an NVIDIA V100 GPU, being orders of magnitude faster than state-of-the-art methods.
arXiv Detail & Related papers (2021-06-02T07:19:58Z) - NoiseVC: Towards High Quality Zero-Shot Voice Conversion [2.3224617218247126]
NoiseVC is an approach that can disentangle contents based on VQ and Contrastive Predictive Coding (CPC)
We conduct several experiments and demonstrate that NoiseVC has a strong disentanglement ability with a small sacrifice of quality.
arXiv Detail & Related papers (2021-04-13T10:12:38Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z) - F0-consistent many-to-many non-parallel voice conversion via conditional
autoencoder [53.901873501494606]
We modified and improved autoencoder-based voice conversion to disentangle content, F0, and speaker identity at the same time.
We can control the F0 contour, generate speech with F0 consistent with the target speaker, and significantly improve quality and similarity.
arXiv Detail & Related papers (2020-04-15T22:00:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.