Transforming Spectrum and Prosody for Emotional Voice Conversion with
Non-Parallel Training Data
- URL: http://arxiv.org/abs/2002.00198v5
- Date: Sat, 24 Oct 2020 06:37:42 GMT
- Title: Transforming Spectrum and Prosody for Emotional Voice Conversion with
Non-Parallel Training Data
- Authors: Kun Zhou, Berrak Sisman, Haizhou Li
- Abstract summary: Many studies require parallel speech data between different emotional patterns, which is not practical in real life.
We propose a CycleGAN network to find an optimal pseudo pair from non-parallel training data.
We also study the use of continuous wavelet transform (CWT) to decompose F0 into ten temporal scales, that describes speech prosody at different time resolution.
- Score: 91.92456020841438
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Emotional voice conversion aims to convert the spectrum and prosody to change
the emotional patterns of speech, while preserving the speaker identity and
linguistic content. Many studies require parallel speech data between different
emotional patterns, which is not practical in real life. Moreover, they often
model the conversion of fundamental frequency (F0) with a simple linear
transform. As F0 is a key aspect of intonation that is hierarchical in nature,
we believe that it is more adequate to model F0 in different temporal scales by
using wavelet transform. We propose a CycleGAN network to find an optimal
pseudo pair from non-parallel training data by learning forward and inverse
mappings simultaneously using adversarial and cycle-consistency losses. We also
study the use of continuous wavelet transform (CWT) to decompose F0 into ten
temporal scales, that describes speech prosody at different time resolution,
for effective F0 conversion. Experimental results show that our proposed
framework outperforms the baselines both in objective and subjective
evaluations.
Related papers
- Transform Once: Efficient Operator Learning in Frequency Domain [69.74509540521397]
We study deep neural networks designed to harness the structure in frequency domain for efficient learning of long-range correlations in space or time.
This work introduces a blueprint for frequency domain learning through a single transform: transform once (T1)
arXiv Detail & Related papers (2022-11-26T01:56:05Z) - DisC-VC: Disentangled and F0-Controllable Neural Voice Conversion [17.83563578034567]
We propose a new variational-autoencoder-based voice conversion model accompanied by an auxiliary network.
We show the effectiveness of the proposed method by objective and subjective evaluations.
arXiv Detail & Related papers (2022-10-20T07:30:07Z) - Towards end-to-end F0 voice conversion based on Dual-GAN with
convolutional wavelet kernels [11.92436948211501]
A single neural network is proposed, in which a first module is used to learn F0 representation over different temporal scales.
A second adversarial module is used to learn the transformation from one emotion to another.
arXiv Detail & Related papers (2021-04-15T07:42:59Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z) - WaveTransform: Crafting Adversarial Examples via Input Decomposition [69.01794414018603]
We introduce WaveTransform', that creates adversarial noise corresponding to low-frequency and high-frequency subbands, separately (or in combination)
Experiments show that the proposed attack is effective against the defense algorithm and is also transferable across CNNs.
arXiv Detail & Related papers (2020-10-29T17:16:59Z) - Spectrum and Prosody Conversion for Cross-lingual Voice Conversion with
CycleGAN [81.79070894458322]
Cross-lingual voice conversion aims to change source speaker's voice to sound like that of target speaker, when source and target speakers speak different languages.
Previous studies on cross-lingual voice conversion mainly focus on spectral conversion with a linear transformation for F0 transfer.
We propose the use of continuous wavelet transform (CWT) decomposition for F0 modeling. CWT provides a way to decompose a signal into different temporal scales that explain prosody in different time resolutions.
arXiv Detail & Related papers (2020-08-11T07:29:55Z) - VAW-GAN for Singing Voice Conversion with Non-parallel Training Data [81.79070894458322]
We propose a singing voice conversion framework based on VAW-GAN.
We train an encoder to disentangle singer identity and singing prosody (F0) from phonetic content.
By conditioning on singer identity and F0, the decoder generates output spectral features with unseen target singer identity.
arXiv Detail & Related papers (2020-08-10T09:44:10Z) - Multi-speaker Emotion Conversion via Latent Variable Regularization and
a Chained Encoder-Decoder-Predictor Network [18.275646344620387]
We propose a novel method for emotion conversion in speech based on a chained encoder-decoder-predictor neural network architecture.
We show that our method outperforms the existing state-of-the-art approaches on both, the saliency of emotion conversion and the quality of resynthesized speech.
arXiv Detail & Related papers (2020-07-25T13:59:22Z) - Non-parallel Emotion Conversion using a Deep-Generative Hybrid Network
and an Adversarial Pair Discriminator [16.18921154013272]
We introduce a novel method for emotion conversion in speech that does not require parallel training data.
Unlike the conventional cycle-GAN, our discriminator classifies whether a pair of input real and generated samples corresponds to the desired emotion conversion.
We show that our model generalizes to new speakers by modifying speech produced by Wavenet.
arXiv Detail & Related papers (2020-07-25T13:50:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.