Many-to-Many Voice Conversion using Conditional Cycle-Consistent
Adversarial Networks
- URL: http://arxiv.org/abs/2002.06328v1
- Date: Sat, 15 Feb 2020 06:03:36 GMT
- Title: Many-to-Many Voice Conversion using Conditional Cycle-Consistent
Adversarial Networks
- Authors: Shindong Lee, BongGu Ko, Keonnyeong Lee, In-Chul Yoo, and Dongsuk Yook
- Abstract summary: We extend the CycleGAN by conditioning the network on speakers.
The proposed method can perform many-to-many voice conversion among multiple speakers using a single generative adversarial network (GAN)
Compared to building multiple CycleGANs for each pair of speakers, the proposed method reduces the computational and spatial cost significantly without compromising the sound quality of the converted voice.
- Score: 3.1317409221921144
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Voice conversion (VC) refers to transforming the speaker characteristics of
an utterance without altering its linguistic contents. Many works on voice
conversion require to have parallel training data that is highly expensive to
acquire. Recently, the cycle-consistent adversarial network (CycleGAN), which
does not require parallel training data, has been applied to voice conversion,
showing the state-of-the-art performance. The CycleGAN based voice conversion,
however, can be used only for a pair of speakers, i.e., one-to-one voice
conversion between two speakers. In this paper, we extend the CycleGAN by
conditioning the network on speakers. As a result, the proposed method can
perform many-to-many voice conversion among multiple speakers using a single
generative adversarial network (GAN). Compared to building multiple CycleGANs
for each pair of speakers, the proposed method reduces the computational and
spatial cost significantly without compromising the sound quality of the
converted voice. Experimental results using the VCC2018 corpus confirm the
efficiency of the proposed method.
Related papers
- Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement [17.645026729525462]
We propose a transformer-based end-to-end model to extract a target speaker's speech from a mixed audio signal.
Our experiments show that the use of a dual path transformer in the separator backbone along with proposed training paradigm improves the CNN baseline by $3.12$ dB points.
arXiv Detail & Related papers (2024-09-02T16:11:12Z) - Who is Authentic Speaker [4.822108779108675]
Voice conversion can pose potential social issues when manipulated voices are employed for deceptive purposes.
It is a big challenge to find who are real speakers from the converted voices as the acoustic characteristics of source speakers are changed greatly.
This study is conducted with the assumption that certain information from the source speakers persists, even when their voices undergo conversion into different target voices.
arXiv Detail & Related papers (2024-04-30T23:41:00Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for
Natural-Sounding Voice Conversion [19.74933410443264]
We present an unsupervised many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2.
Our model is trained only with 20 English speakers.
It generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion.
arXiv Detail & Related papers (2021-07-21T23:44:17Z) - Many-to-Many Voice Conversion based Feature Disentanglement using
Variational Autoencoder [2.4975981795360847]
We propose a new method based on feature disentanglement to tackle many to many voice conversion.
The method has the capability to disentangle speaker identity and linguistic content from utterances.
It can convert from many source speakers to many target speakers with a single autoencoder network.
arXiv Detail & Related papers (2021-07-11T13:31:16Z) - FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and
Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0.
FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance.
This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z) - Spectrum and Prosody Conversion for Cross-lingual Voice Conversion with
CycleGAN [81.79070894458322]
Cross-lingual voice conversion aims to change source speaker's voice to sound like that of target speaker, when source and target speakers speak different languages.
Previous studies on cross-lingual voice conversion mainly focus on spectral conversion with a linear transformation for F0 transfer.
We propose the use of continuous wavelet transform (CWT) decomposition for F0 modeling. CWT provides a way to decompose a signal into different temporal scales that explain prosody in different time resolutions.
arXiv Detail & Related papers (2020-08-11T07:29:55Z) - MultiSpeech: Multi-Speaker Text to Speech with Transformer [145.56725956639232]
Transformer-based text to speech (TTS) model (e.g., Transformer TTSciteli 2019neural, FastSpeechciteren 2019fastspeech) has shown the advantages of training and inference efficiency over RNN-based model.
We develop a robust and high-quality multi-speaker Transformer TTS system called MultiSpeech, with several specially designed components/techniques to improve text-to-speech alignment.
arXiv Detail & Related papers (2020-06-08T15:05:28Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z) - Voice Separation with an Unknown Number of Multiple Speakers [113.91855071999298]
We present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously.
The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed.
arXiv Detail & Related papers (2020-02-29T20:02:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.