Spectrum and Prosody Conversion for Cross-lingual Voice Conversion with
CycleGAN
- URL: http://arxiv.org/abs/2008.04562v3
- Date: Tue, 3 Nov 2020 16:34:35 GMT
- Title: Spectrum and Prosody Conversion for Cross-lingual Voice Conversion with
CycleGAN
- Authors: Zongyang Du, Kun Zhou, Berrak Sisman, Haizhou Li
- Abstract summary: Cross-lingual voice conversion aims to change source speaker's voice to sound like that of target speaker, when source and target speakers speak different languages.
Previous studies on cross-lingual voice conversion mainly focus on spectral conversion with a linear transformation for F0 transfer.
We propose the use of continuous wavelet transform (CWT) decomposition for F0 modeling. CWT provides a way to decompose a signal into different temporal scales that explain prosody in different time resolutions.
- Score: 81.79070894458322
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-lingual voice conversion aims to change source speaker's voice to sound
like that of target speaker, when source and target speakers speak different
languages. It relies on non-parallel training data from two different
languages, hence, is more challenging than mono-lingual voice conversion.
Previous studies on cross-lingual voice conversion mainly focus on spectral
conversion with a linear transformation for F0 transfer. However, as an
important prosodic factor, F0 is inherently hierarchical, thus it is
insufficient to just use a linear method for conversion. We propose the use of
continuous wavelet transform (CWT) decomposition for F0 modeling. CWT provides
a way to decompose a signal into different temporal scales that explain prosody
in different time resolutions. We also propose to train two CycleGAN pipelines
for spectrum and prosody mapping respectively. In this way, we eliminate the
need for parallel data of any two languages and any alignment techniques.
Experimental results show that our proposed Spectrum-Prosody-CycleGAN framework
outperforms the Spectrum-CycleGAN baseline in subjective evaluation. To our
best knowledge, this is the first study of prosody in cross-lingual voice
conversion.
Related papers
- MulliVC: Multi-lingual Voice Conversion With Cycle Consistency [75.59590240034261]
MulliVC is a novel voice conversion system that only converts timbre and keeps original content and source language prosody without multi-lingual paired data.
Both objective and subjective results indicate that MulliVC significantly surpasses other methods in both monolingual and cross-lingual contexts.
arXiv Detail & Related papers (2024-08-08T18:12:51Z) - StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for
Natural-Sounding Voice Conversion [19.74933410443264]
We present an unsupervised many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2.
Our model is trained only with 20 English speakers.
It generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion.
arXiv Detail & Related papers (2021-07-21T23:44:17Z) - VAW-GAN for Disentanglement and Recomposition of Emotional Elements in
Speech [91.92456020841438]
We study the disentanglement and recomposition of emotional elements in speech through variational autoencoding Wasserstein generative adversarial network (VAW-GAN)
We propose a speaker-dependent EVC framework that includes two VAW-GAN pipelines, one for spectrum conversion, and another for prosody conversion.
Experiments validate the effectiveness of our proposed method in both objective and subjective evaluations.
arXiv Detail & Related papers (2020-11-03T08:49:33Z) - Transfer Learning from Monolingual ASR to Transcription-free
Cross-lingual Voice Conversion [0.0]
Cross-lingual voice conversion is a task that aims to synthesize target voices with the same content while source and target speakers speak in different languages.
In this paper, we focus on knowledge transfer from monolin-gual ASR to cross-lingual VC.
We successfully address cross-lingual VC without any transcription or language-specific knowledge for foreign speech.
arXiv Detail & Related papers (2020-09-30T13:44:35Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z) - End-to-End Whisper to Natural Speech Conversion using Modified
Transformer Network [0.8399688944263843]
We introduce whisper-to-natural-speech conversion using sequence-to-sequence approach.
We investigate different features like Mel frequency cepstral coefficients and smoothed spectral features.
The proposed networks are trained end-to-end using supervised approach for feature-to-feature transformation.
arXiv Detail & Related papers (2020-04-20T14:47:46Z) - Many-to-Many Voice Conversion using Conditional Cycle-Consistent
Adversarial Networks [3.1317409221921144]
We extend the CycleGAN by conditioning the network on speakers.
The proposed method can perform many-to-many voice conversion among multiple speakers using a single generative adversarial network (GAN)
Compared to building multiple CycleGANs for each pair of speakers, the proposed method reduces the computational and spatial cost significantly without compromising the sound quality of the converted voice.
arXiv Detail & Related papers (2020-02-15T06:03:36Z) - Transforming Spectrum and Prosody for Emotional Voice Conversion with
Non-Parallel Training Data [91.92456020841438]
Many studies require parallel speech data between different emotional patterns, which is not practical in real life.
We propose a CycleGAN network to find an optimal pseudo pair from non-parallel training data.
We also study the use of continuous wavelet transform (CWT) to decompose F0 into ten temporal scales, that describes speech prosody at different time resolution.
arXiv Detail & Related papers (2020-02-01T12:36:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.