DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis
- URL: http://arxiv.org/abs/2105.02446v1
- Date: Thu, 6 May 2021 05:21:42 GMT
- Title: DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis
- Authors: Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, Peng Liu, Zhou Zhao
- Abstract summary: DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
- Score: 53.19363127760314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Singing voice synthesis (SVS) system is built to synthesize high-quality and
expressive singing voice, in which the acoustic model generates the acoustic
features (e.g., mel-spectrogram) given a music score. Previous singing acoustic
models adopt simple loss (e.g., L1 and L2) or generative adversarial network
(GAN) to reconstruct the acoustic features, while they suffer from
over-smoothing and unstable training issues respectively, which hinder the
naturalness of synthesized singing. In this work, we propose DiffSinger, an
acoustic model for SVS based on the diffusion probabilistic model. DiffSinger
is a parameterized Markov chain which iteratively converts the noise into
mel-spectrogram conditioned on the music score. By implicitly optimizing
variational bound, DiffSinger can be stably trained and generates realistic
outputs. To further improve the voice quality, we introduce a \textbf{shallow
diffusion mechanism} to make better use of the prior knowledge learned by the
simple loss. Specifically, DiffSinger starts generation at a shallow step
smaller than the total number of diffusion steps, according to the intersection
of the diffusion trajectories of the ground-truth mel-spectrogram and the one
predicted by a simple mel-spectrogram decoder. Besides, we train a boundary
prediction network to locate the intersection and determine the shallow step
adaptively. The evaluations conducted on the Chinese singing dataset
demonstrate that DiffSinger outperforms state-of-the-art SVS work with a
notable margin (0.11 MOS gains). Our extensional experiments also prove the
generalization of DiffSinger on text-to-speech task.
Related papers
- RDSinger: Reference-based Diffusion Network for Singing Voice Synthesis [3.7937714754535503]
Singing voice synthesis (SVS) aims to produce high-fidelity singing audio from music scores.
diffusion models have shown exceptional performance in various generative tasks like image and video creation.
We introduce RDSinger, a reference-based denoising diffusion network that generates high-quality audio for SVS tasks.
arXiv Detail & Related papers (2024-10-29T01:01:18Z) - Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and
Music Synthesis [0.0]
We introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN.
We show the merits of our proposed model for speech and music synthesis on several datasets.
arXiv Detail & Related papers (2024-01-30T09:17:57Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - Towards Improving the Expressiveness of Singing Voice Synthesis with
BERT Derived Semantic Information [51.02264447897833]
This paper presents an end-to-end high-quality singing voice synthesis (SVS) system that uses bidirectional encoder representation from Transformers (BERT) derived semantic embeddings.
The proposed SVS system can produce singing voice with higher-quality outperforming VISinger.
arXiv Detail & Related papers (2023-08-31T16:12:01Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses [13.178747366560534]
We develop a new multi-singer Chinese neural singing voice synthesis system named WeSinger.
quantitative and qualitative evaluation results demonstrate the effectiveness of WeSinger in terms of accuracy and naturalness.
arXiv Detail & Related papers (2022-03-21T06:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.