DeepSinger: Singing Voice Synthesis with Data Mined From the Web
- URL: http://arxiv.org/abs/2007.04590v2
- Date: Wed, 15 Jul 2020 14:37:45 GMT
- Title: DeepSinger: Singing Voice Synthesis with Data Mined From the Web
- Authors: Yi Ren, Xu Tan, Tao Qin, Jian Luan, Zhou Zhao, Tie-Yan Liu
- Abstract summary: DeepSinger is a multi-lingual singing voice synthesis system built from scratch using singing training data mined from music websites.
We evaluate DeepSinger on our mined singing dataset that consists of about 92 hours data from 89 singers on three languages.
- Score: 194.10598657846145
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we develop DeepSinger, a multi-lingual multi-singer singing
voice synthesis (SVS) system, which is built from scratch using singing
training data mined from music websites. The pipeline of DeepSinger consists of
several steps, including data crawling, singing and accompaniment separation,
lyrics-to-singing alignment, data filtration, and singing modeling.
Specifically, we design a lyrics-to-singing alignment model to automatically
extract the duration of each phoneme in lyrics starting from coarse-grained
sentence level to fine-grained phoneme level, and further design a
multi-lingual multi-singer singing model based on a feed-forward Transformer to
directly generate linear-spectrograms from lyrics, and synthesize voices using
Griffin-Lim. DeepSinger has several advantages over previous SVS systems: 1) to
the best of our knowledge, it is the first SVS system that directly mines
training data from music websites, 2) the lyrics-to-singing alignment model
further avoids any human efforts for alignment labeling and greatly reduces
labeling cost, 3) the singing model based on a feed-forward Transformer is
simple and efficient, by removing the complicated acoustic feature modeling in
parametric synthesis and leveraging a reference encoder to capture the timbre
of a singer from noisy singing data, and 4) it can synthesize singing voices in
multiple languages and multiple singers. We evaluate DeepSinger on our mined
singing dataset that consists of about 92 hours data from 89 singers on three
languages (Chinese, Cantonese and English). The results demonstrate that with
the singing data purely mined from the Web, DeepSinger can synthesize
high-quality singing voices in terms of both pitch accuracy and voice
naturalness (footnote: Our audio samples are shown in
https://speechresearch.github.io/deepsinger/.)
Related papers
- GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks [52.30565320125514]
GTSinger is a large global, multi-technique, free-to-use, high-quality singing corpus with realistic music scores.
We collect 80.59 hours of high-quality singing voices, forming the largest recorded singing dataset.
We conduct four benchmark experiments: technique-controllable singing voice synthesis, technique recognition, style transfer, and speech-to-singing conversion.
arXiv Detail & Related papers (2024-09-20T18:18:14Z) - MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance [14.22941848955693]
MakeSinger is a semi-supervised training method for singing voice synthesis.
Our novel dual guiding mechanism gives text and pitch guidance on the reverse diffusion step.
We demonstrate that by adding Text-to-Speech (TTS) data in training, the model can synthesize the singing voices of TTS speakers even without their singing voices.
arXiv Detail & Related papers (2024-06-10T01:47:52Z) - Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment [56.019288564115136]
We propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation.
We develop Melodist, a two-stage text-to-song method that consists of singing voice synthesis (SVS) and vocal-to-accompaniment (V2A) synthesis.
evaluation results on our dataset demonstrate that Melodist can synthesize songs with comparable quality and style consistency.
arXiv Detail & Related papers (2024-04-14T18:00:05Z) - Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - BiSinger: Bilingual Singing Voice Synthesis [9.600465391545477]
This paper presents BiSinger, a bilingual pop SVS system for English and Chinese Mandarin.
We design a shared representation between Chinese and English singing voices, achieved by using the CMU dictionary with mapping rules.
Experiments affirm that our language-independent representation and incorporation of related datasets enable a single model with enhanced performance in English and code-switch SVS.
arXiv Detail & Related papers (2023-09-25T12:31:05Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses [13.178747366560534]
We develop a new multi-singer Chinese neural singing voice synthesis system named WeSinger.
quantitative and qualitative evaluation results demonstrate the effectiveness of WeSinger in terms of accuracy and naturalness.
arXiv Detail & Related papers (2022-03-21T06:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.