StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for
Natural-Sounding Voice Conversion
- URL: http://arxiv.org/abs/2107.10394v2
- Date: Fri, 23 Jul 2021 01:08:09 GMT
- Title: StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for
Natural-Sounding Voice Conversion
- Authors: Yinghao Aaron Li, Ali Zare, Nima Mesgarani
- Abstract summary: We present an unsupervised many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2.
Our model is trained only with 20 English speakers.
It generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion.
- Score: 19.74933410443264
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present an unsupervised non-parallel many-to-many voice conversion (VC)
method using a generative adversarial network (GAN) called StarGAN v2. Using a
combination of adversarial source classifier loss and perceptual loss, our
model significantly outperforms previous VC models. Although our model is
trained only with 20 English speakers, it generalizes to a variety of voice
conversion tasks, such as any-to-many, cross-lingual, and singing conversion.
Using a style encoder, our framework can also convert plain reading speech into
stylistic speech, such as emotional and falsetto speech. Subjective and
objective evaluation experiments on a non-parallel many-to-many voice
conversion task revealed that our model produces natural sounding voices, close
to the sound quality of state-of-the-art text-to-speech (TTS) based voice
conversion methods without the need for text labels. Moreover, our model is
completely convolutional and with a faster-than-real-time vocoder such as
Parallel WaveGAN can perform real-time voice conversion.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.