VAW-GAN for Singing Voice Conversion with Non-parallel Training Data
- URL: http://arxiv.org/abs/2008.03992v3
- Date: Tue, 3 Nov 2020 10:58:10 GMT
- Title: VAW-GAN for Singing Voice Conversion with Non-parallel Training Data
- Authors: Junchen Lu, Kun Zhou, Berrak Sisman, Haizhou Li
- Abstract summary: We propose a singing voice conversion framework based on VAW-GAN.
We train an encoder to disentangle singer identity and singing prosody (F0) from phonetic content.
By conditioning on singer identity and F0, the decoder generates output spectral features with unseen target singer identity.
- Score: 81.79070894458322
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Singing voice conversion aims to convert singer's voice from source to target
without changing singing content. Parallel training data is typically required
for the training of singing voice conversion system, that is however not
practical in real-life applications. Recent encoder-decoder structures, such as
variational autoencoding Wasserstein generative adversarial network (VAW-GAN),
provide an effective way to learn a mapping through non-parallel training data.
In this paper, we propose a singing voice conversion framework that is based on
VAW-GAN. We train an encoder to disentangle singer identity and singing prosody
(F0 contour) from phonetic content. By conditioning on singer identity and F0,
the decoder generates output spectral features with unseen target singer
identity, and improves the F0 rendering. Experimental results show that the
proposed framework achieves better performance than the baseline frameworks.
Related papers
- Singer Identity Representation Learning using Self-Supervised Techniques [0.0]
We propose a framework for training singer identity encoders to extract representations suitable for various singing-related tasks.
We explore different self-supervised learning techniques on a large collection of isolated vocal tracks.
We evaluate the quality of the resulting representations on singer similarity and identification tasks.
arXiv Detail & Related papers (2024-01-10T10:41:38Z) - A Comparative Analysis Of Latent Regressor Losses For Singing Voice
Conversion [15.691936529849539]
Singer identity embedding (SIE) network on mel-spectrograms of singer recordings to produce singer-specific variance encodings.
We propose a pitch-matching mechanism between source and target singers to ensure these evaluations are not influenced by differences in pitch register.
arXiv Detail & Related papers (2023-02-27T11:26:57Z) - Learning the Beauty in Songs: Neural Singing Voice Beautifier [69.21263011242907]
We are interested in a novel task, singing voice beautifying (SVB)
Given the singing voice of an amateur singer, SVB aims to improve the intonation and vocal tone of the voice, while keeping the content and vocal timbre.
We introduce Neural Singing Voice Beautifier (NSVB), the first generative model to solve the SVB task.
arXiv Detail & Related papers (2022-02-27T03:10:12Z) - PPG-based singing voice conversion with adversarial representation
learning [18.937609682084034]
Singing voice conversion aims to convert the voice of one singer to that of other singers while keeping the singing content and melody.
We build an end-to-end architecture, taking posteriorgrams as inputs and generating mel spectrograms.
Our methods can significantly improve the conversion performance in terms of naturalness, melody, and voice similarity.
arXiv Detail & Related papers (2020-10-28T08:03:27Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z) - DeepSinger: Singing Voice Synthesis with Data Mined From the Web [194.10598657846145]
DeepSinger is a multi-lingual singing voice synthesis system built from scratch using singing training data mined from music websites.
We evaluate DeepSinger on our mined singing dataset that consists of about 92 hours data from 89 singers on three languages.
arXiv Detail & Related papers (2020-07-09T07:00:48Z) - F0-consistent many-to-many non-parallel voice conversion via conditional
autoencoder [53.901873501494606]
We modified and improved autoencoder-based voice conversion to disentangle content, F0, and speaker identity at the same time.
We can control the F0 contour, generate speech with F0 consistent with the target speaker, and significantly improve quality and similarity.
arXiv Detail & Related papers (2020-04-15T22:00:06Z) - Speech-to-Singing Conversion in an Encoder-Decoder Framework [38.111942306157545]
We take a learning based approach to the problem of converting spoken lines into sung ones.
We learn encodings that enable us to synthesize singing that preserves the linguistic content and timbre of the speaker.
arXiv Detail & Related papers (2020-02-16T15:33:41Z) - Transforming Spectrum and Prosody for Emotional Voice Conversion with
Non-Parallel Training Data [91.92456020841438]
Many studies require parallel speech data between different emotional patterns, which is not practical in real life.
We propose a CycleGAN network to find an optimal pseudo pair from non-parallel training data.
We also study the use of continuous wavelet transform (CWT) to decompose F0 into ten temporal scales, that describes speech prosody at different time resolution.
arXiv Detail & Related papers (2020-02-01T12:36:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.