StarGAN-ZSVC: Towards Zero-Shot Voice Conversion in Low-Resource
Contexts
- URL: http://arxiv.org/abs/2106.00043v1
- Date: Mon, 31 May 2021 18:21:28 GMT
- Title: StarGAN-ZSVC: Towards Zero-Shot Voice Conversion in Low-Resource
Contexts
- Authors: Matthew Baas, Herman Kamper
- Abstract summary: To be useful in a wider range of contexts, voice conversion systems need to be trainable without access to parallel data.
This paper extends recent voice conversion models based on generative adversarial networks (GANs)
We show that real-time zero-shot voice conversion is possible even for a model trained on very little data.
- Score: 32.170748231414365
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Voice conversion is the task of converting a spoken utterance from a source
speaker so that it appears to be said by a different target speaker while
retaining the linguistic content of the utterance. Recent advances have led to
major improvements in the quality of voice conversion systems. However, to be
useful in a wider range of contexts, voice conversion systems would need to be
(i) trainable without access to parallel data, (ii) work in a zero-shot setting
where both the source and target speakers are unseen during training, and (iii)
run in real time or faster. Recent techniques fulfil one or two of these
requirements, but not all three. This paper extends recent voice conversion
models based on generative adversarial networks (GANs), to satisfy all three of
these conditions. We specifically extend the recent StarGAN-VC model by
conditioning it on a speaker embedding (from a potentially unseen speaker).
This allows the model to be used in a zero-shot setting, and we therefore call
it StarGAN-ZSVC. We compare StarGAN-ZSVC against other voice conversion
techniques in a low-resource setting using a small 9-minute training set.
Compared to AutoVC -- another recent neural zero-shot approach -- we observe
that StarGAN-ZSVC gives small improvements in the zero-shot setting, showing
that real-time zero-shot voice conversion is possible even for a model trained
on very little data. Further work is required to see whether scaling up
StarGAN-ZSVC will also improve zero-shot voice conversion quality in
high-resource contexts.
Related papers
- Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Pheme: Efficient and Conversational Speech Generation [52.34331755341856]
We introduce the Pheme model series that offers compact yet high-performing conversational TTS models.
It can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models.
arXiv Detail & Related papers (2024-01-05T14:47:20Z) - Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust
Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation [41.98697872087318]
We introduce Diff-HierVC, a hierarchical VC system based on two diffusion models.
Our model achieves a CER of 0.83% and EER of 3.29% in zero-shot VC scenarios.
arXiv Detail & Related papers (2023-11-08T14:02:53Z) - HiFi-VC: High Quality ASR-Based Voice Conversion [0.0]
We propose a new any-to-any voice conversion pipeline.
Our approach uses automated speech recognition features, pitch tracking, and a state-of-the-art waveform prediction model.
arXiv Detail & Related papers (2022-03-31T10:45:32Z) - Training Robust Zero-Shot Voice Conversion Models with Self-supervised
Features [24.182732872327183]
Unsampling Zero-Shot Voice Conversion (VC) aims to modify the speaker characteristic of an utterance to match an unseen target speaker.
We show that high-quality audio samples can be achieved by using a length resupervised decoder.
arXiv Detail & Related papers (2021-12-08T17:27:39Z) - StarGAN-VC+ASR: StarGAN-based Non-Parallel Voice Conversion Regularized
by Automatic Speech Recognition [23.75478998795749]
We propose the use of automatic speech recognition to assist model training.
We show that using our proposed method, StarGAN-VC can retain more linguistic information than vanilla StarGAN-VC.
arXiv Detail & Related papers (2021-08-10T01:18:31Z) - StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for
Natural-Sounding Voice Conversion [19.74933410443264]
We present an unsupervised many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2.
Our model is trained only with 20 English speakers.
It generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion.
arXiv Detail & Related papers (2021-07-21T23:44:17Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant
Environments [76.98764900754111]
Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker.
We propose Voicy, a new VC framework particularly tailored for noisy speech.
Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder.
arXiv Detail & Related papers (2021-06-16T15:47:06Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net
architecture [71.45920122349628]
Auto-encoder-based VC methods disentangle the speaker and the content in input speech without given the speaker's identity.
We use the U-Net architecture within an auto-encoder-based VC system to improve audio quality.
arXiv Detail & Related papers (2020-06-07T14:01:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.