Sinsy: A Deep Neural Network-Based Singing Voice Synthesis System
- URL: http://arxiv.org/abs/2108.02776v1
- Date: Thu, 5 Aug 2021 17:59:58 GMT
- Title: Sinsy: A Deep Neural Network-Based Singing Voice Synthesis System
- Authors: Yukiya Hono, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi
Tokuda
- Abstract summary: This paper presents Sinsy, a deep neural network (DNN)-based singing voice synthesis (SVS) system.
The proposed system is composed of four modules: a time-lag model, a duration model, an acoustic model, and a vocoder.
Experimental results show our system can synthesize a singing voice with better timing, more natural vibrato, and correct pitch.
- Score: 25.573552964889963
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents Sinsy, a deep neural network (DNN)-based singing voice
synthesis (SVS) system. In recent years, DNNs have been utilized in statistical
parametric SVS systems, and DNN-based SVS systems have demonstrated better
performance than conventional hidden Markov model-based ones. SVS systems are
required to synthesize a singing voice with pitch and timing that strictly
follow a given musical score. Additionally, singing expressions that are not
described on the musical score, such as vibrato and timing fluctuations, should
be reproduced. The proposed system is composed of four modules: a time-lag
model, a duration model, an acoustic model, and a vocoder, and singing voices
can be synthesized taking these characteristics of singing voices into account.
To better model a singing voice, the proposed system incorporates improved
approaches to modeling pitch and vibrato and better training criteria into the
acoustic model. In addition, we incorporated PeriodNet, a non-autoregressive
neural vocoder with robustness for the pitch, into our systems to generate a
high-fidelity singing voice waveform. Moreover, we propose automatic pitch
correction techniques for DNN-based SVS to synthesize singing voices with
correct pitch even if the training data has out-of-tune phrases. Experimental
results show our system can synthesize a singing voice with better timing, more
natural vibrato, and correct pitch, and it can achieve better mean opinion
scores in subjective evaluation tests.
Related papers
- Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis [63.18764165357298]
Style transfer for out-of-domain singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles.
StyleSinger is the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples.
Our evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples.
arXiv Detail & Related papers (2023-12-17T15:26:16Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - Towards Improving the Expressiveness of Singing Voice Synthesis with
BERT Derived Semantic Information [51.02264447897833]
This paper presents an end-to-end high-quality singing voice synthesis (SVS) system that uses bidirectional encoder representation from Transformers (BERT) derived semantic embeddings.
The proposed SVS system can produce singing voice with higher-quality outperforming VISinger.
arXiv Detail & Related papers (2023-08-31T16:12:01Z) - HiddenSinger: High-Quality Singing Voice Synthesis via Neural Audio
Codec and Latent Diffusion Models [25.966328901566815]
We propose HiddenSinger, a high-quality singing voice synthesis system using neural audio and latent diffusion models.
In addition, our proposed model is extended to an unsupervised singing voice learning framework, HiddenSinger-U, to train the model.
Experimental results demonstrate that our model outperforms previous models in terms of audio quality.
arXiv Detail & Related papers (2023-06-12T01:21:41Z) - NNSVS: A Neural Network-Based Singing Voice Synthesis Toolkit [30.894603855905828]
NNSVS is an open-source software for neural network-based singing voice synthesis research.
It is inspired by Sinsy, an open-source pioneer in singing voice synthesis research.
arXiv Detail & Related papers (2022-10-28T08:37:13Z) - WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses [13.178747366560534]
We develop a new multi-singer Chinese neural singing voice synthesis system named WeSinger.
quantitative and qualitative evaluation results demonstrate the effectiveness of WeSinger in terms of accuracy and naturalness.
arXiv Detail & Related papers (2022-03-21T06:42:44Z) - Learning the Beauty in Songs: Neural Singing Voice Beautifier [69.21263011242907]
We are interested in a novel task, singing voice beautifying (SVB)
Given the singing voice of an amateur singer, SVB aims to improve the intonation and vocal tone of the voice, while keeping the content and vocal timbre.
We introduce Neural Singing Voice Beautifier (NSVB), the first generative model to solve the SVB task.
arXiv Detail & Related papers (2022-02-27T03:10:12Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z) - Adversarially Trained Multi-Singer Sequence-To-Sequence Singing
Synthesizer [11.598416444452619]
We design a multi-singer framework to leverage all the existing singing data of different singers.
We incorporate an adversarial task of singer classification to make encoder output less singer dependent.
The proposed synthesizer can generate higher quality singing voice than baseline.
arXiv Detail & Related papers (2020-06-18T07:20:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.