PPG-based singing voice conversion with adversarial representation
learning
- URL: http://arxiv.org/abs/2010.14804v1
- Date: Wed, 28 Oct 2020 08:03:27 GMT
- Title: PPG-based singing voice conversion with adversarial representation
learning
- Authors: Zhonghao Li, Benlai Tang, Xiang Yin, Yuan Wan, Ling Xu, Chen Shen,
Zejun Ma
- Abstract summary: Singing voice conversion aims to convert the voice of one singer to that of other singers while keeping the singing content and melody.
We build an end-to-end architecture, taking posteriorgrams as inputs and generating mel spectrograms.
Our methods can significantly improve the conversion performance in terms of naturalness, melody, and voice similarity.
- Score: 18.937609682084034
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Singing voice conversion (SVC) aims to convert the voice of one singer to
that of other singers while keeping the singing content and melody. On top of
recent voice conversion works, we propose a novel model to steadily convert
songs while keeping their naturalness and intonation. We build an end-to-end
architecture, taking phonetic posteriorgrams (PPGs) as inputs and generating
mel spectrograms. Specifically, we implement two separate encoders: one encodes
PPGs as content, and the other compresses mel spectrograms to supply acoustic
and musical information. To improve the performance on timbre and melody, an
adversarial singer confusion module and a mel-regressive representation
learning module are designed for the model. Objective and subjective
experiments are conducted on our private Chinese singing corpus. Comparing with
the baselines, our methods can significantly improve the conversion performance
in terms of naturalness, melody, and voice similarity. Moreover, our PPG-based
method is proved to be robust for noisy sources.
Related papers
- StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis [63.18764165357298]
Style transfer for out-of-domain singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles.
StyleSinger is the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples.
Our evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples.
arXiv Detail & Related papers (2023-12-17T15:26:16Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - Effects of Convolutional Autoencoder Bottleneck Width on StarGAN-based
Singing Technique Conversion [2.2221991003992967]
Singing technique conversion (STC) refers to the task of converting from one voice technique to another.
Previous STC studies, as well as singing voice conversion research in general, have utilized convolutional autoencoders (CAEs) for conversion.
We constructed a GAN-based multi-domain STC system which took advantage of the WORLD vocoder representation and the CAE architecture.
arXiv Detail & Related papers (2023-08-19T14:13:28Z) - Robust One-Shot Singing Voice Conversion [28.707278256253385]
High-quality singing voice conversion (SVC) of unseen singers remains challenging due to wide variety of musical expressions in pitch, loudness, and pronunciation.
We present a robust one-shot SVC that performs any-to-any SVC robustly even on distorted singing voices.
Experimental results show that the proposed method outperforms state-of-the-art one-shot SVC baselines for both seen and unseen singers.
arXiv Detail & Related papers (2022-10-20T08:47:35Z) - Learning the Beauty in Songs: Neural Singing Voice Beautifier [69.21263011242907]
We are interested in a novel task, singing voice beautifying (SVB)
Given the singing voice of an amateur singer, SVB aims to improve the intonation and vocal tone of the voice, while keeping the content and vocal timbre.
We introduce Neural Singing Voice Beautifier (NSVB), the first generative model to solve the SVB task.
arXiv Detail & Related papers (2022-02-27T03:10:12Z) - Towards High-fidelity Singing Voice Conversion with Acoustic Reference
and Contrastive Predictive Coding [6.278338686038089]
phonetic posteriorgrams based methods have been quite popular in non-parallel singing voice conversion systems.
Due to the lack of acoustic information in PPGs, style and naturalness of the converted singing voices are still limited.
Our proposed model can significantly improve the naturalness of converted singing voices and the similarity with the target singer.
arXiv Detail & Related papers (2021-10-10T10:27:20Z) - VAW-GAN for Singing Voice Conversion with Non-parallel Training Data [81.79070894458322]
We propose a singing voice conversion framework based on VAW-GAN.
We train an encoder to disentangle singer identity and singing prosody (F0) from phonetic content.
By conditioning on singer identity and F0, the decoder generates output spectral features with unseen target singer identity.
arXiv Detail & Related papers (2020-08-10T09:44:10Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z) - DeepSinger: Singing Voice Synthesis with Data Mined From the Web [194.10598657846145]
DeepSinger is a multi-lingual singing voice synthesis system built from scratch using singing training data mined from music websites.
We evaluate DeepSinger on our mined singing dataset that consists of about 92 hours data from 89 singers on three languages.
arXiv Detail & Related papers (2020-07-09T07:00:48Z) - Speech-to-Singing Conversion in an Encoder-Decoder Framework [38.111942306157545]
We take a learning based approach to the problem of converting spoken lines into sung ones.
We learn encodings that enable us to synthesize singing that preserves the linguistic content and timbre of the speaker.
arXiv Detail & Related papers (2020-02-16T15:33:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.