Related papers: ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps

ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps

URL: http://arxiv.org/abs/2410.15342v1
Date: Sun, 20 Oct 2024 09:32:03 GMT
Title: ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps
Authors: Yulin Song, Guorui Sang, Jing Yu, Chuangbai Xiao,
Abstract summary: We propose a singing voice synthesis method based on the consistency model, ConSinger, to achieve high-fidelity singing voice synthesis with minimal steps. Our experiments show that ConSinger is highly competitive with the baseline model in terms of generation speed and quality.
Score: 4.319804315515349
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Singing voice synthesis (SVS) system is expected to generate high-fidelity singing voice from given music scores (lyrics, duration and pitch). Recently, diffusion models have performed well in this field. However, sacrificing inference speed to exchange with high-quality sample generation limits its application scenarios. In order to obtain high quality synthetic singing voice more efficiently, we propose a singing voice synthesis method based on the consistency model, ConSinger, to achieve high-fidelity singing voice synthesis with minimal steps. The model is trained by applying consistency constraint and the generation quality is greatly improved at the expense of a small amount of inference speed. Our experiments show that ConSinger is highly competitive with the baseline model in terms of generation speed and quality. Audio samples are available at https://keylxiao.github.io/consinger.

Related papers

Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language. We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation. Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z)
StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis [63.18764165357298]
Style transfer for out-of-domain singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles. StyleSinger is the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples. Our evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples.
arXiv Detail & Related papers (2023-12-17T15:26:16Z)
Enhancing the vocal range of single-speaker singing voice synthesis with melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker. It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice. Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z)
HiddenSinger: High-Quality Singing Voice Synthesis via Neural Audio Codec and Latent Diffusion Models [25.966328901566815]
We propose HiddenSinger, a high-quality singing voice synthesis system using neural audio and latent diffusion models. In addition, our proposed model is extended to an unsupervised singing voice learning framework, HiddenSinger-U, to train the model. Experimental results demonstrate that our model outperforms previous models in terms of audio quality.
arXiv Detail & Related papers (2023-06-12T01:21:41Z)
Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations. We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z)
CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model [41.21042900853639]
We propose a "Co"nsistency "Mo"del-based "Speech" synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step. By generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time.
arXiv Detail & Related papers (2023-05-11T15:51:46Z)
WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses [13.178747366560534]
We develop a new multi-singer Chinese neural singing voice synthesis system named WeSinger. quantitative and qualitative evaluation results demonstrate the effectiveness of WeSinger in terms of accuracy and naturalness.
arXiv Detail & Related papers (2022-03-21T06:42:44Z)
Sinsy: A Deep Neural Network-Based Singing Voice Synthesis System [25.573552964889963]
This paper presents Sinsy, a deep neural network (DNN)-based singing voice synthesis (SVS) system. The proposed system is composed of four modules: a time-lag model, a duration model, an acoustic model, and a vocoder. Experimental results show our system can synthesize a singing voice with better timing, more natural vibrato, and correct pitch.
arXiv Detail & Related papers (2021-08-05T17:59:58Z)
DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score. The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z)
HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis [153.48507947322886]
HiFiSinger is an SVS system towards high-fidelity singing voice. It consists of a FastSpeech based acoustic model and a Parallel WaveGAN based vocoder. Experiment results show that HiFiSinger synthesizes high-fidelity singing voices with much higher quality.
arXiv Detail & Related papers (2020-09-03T16:31:02Z)
Adversarially Trained Multi-Singer Sequence-To-Sequence Singing Synthesizer [11.598416444452619]
We design a multi-singer framework to leverage all the existing singing data of different singers. We incorporate an adversarial task of singer classification to make encoder output less singer dependent. The proposed synthesizer can generate higher quality singing voice than baseline.
arXiv Detail & Related papers (2020-06-18T07:20:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.