RMSSinger: Realistic-Music-Score based Singing Voice Synthesis
- URL: http://arxiv.org/abs/2305.10686v1
- Date: Thu, 18 May 2023 03:57:51 GMT
- Title: RMSSinger: Realistic-Music-Score based Singing Voice Synthesis
- Authors: Jinzheng He, Jinglin Liu, Zhenhui Ye, Rongjie Huang, Chenye Cui,
Huadai Liu, Zhou Zhao
- Abstract summary: RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types.
We propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input.
In RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment.
- Score: 56.51475521778443
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We are interested in a challenging task, Realistic-Music-Score based Singing
Voice Synthesis (RMS-SVS). RMS-SVS aims to generate high-quality singing voices
given realistic music scores with different note types (grace, slur, rest,
etc.). Though significant progress has been achieved, recent singing voice
synthesis (SVS) methods are limited to fine-grained music scores, which require
a complicated data collection pipeline with time-consuming manual annotation to
align music notes with phonemes. Furthermore, these manual annotation destroys
the regularity of note durations in music scores, making fine-grained music
scores inconvenient for composing. To tackle these challenges, we propose
RMSSinger, the first RMS-SVS method, which takes realistic music scores as
input, eliminating most of the tedious manual annotation and avoiding the
aforementioned inconvenience. Note that music scores are based on words rather
than phonemes, in RMSSinger, we introduce word-level modeling to avoid the
time-consuming phoneme duration annotation and the complicated phoneme-level
mel-note alignment. Furthermore, we propose the first diffusion-based pitch
modeling method, which ameliorates the naturalness of existing pitch-modeling
methods. To achieve these, we collect a new dataset containing realistic music
scores and singing voices according to these realistic music scores from
professional singers. Extensive experiments on the dataset demonstrate the
effectiveness of our methods. Audio samples are available at
https://rmssinger.github.io/.
Related papers
- Automatic Estimation of Singing Voice Musical Dynamics [9.343063100314687]
We propose a methodology for dataset curation.
We compile a dataset comprising 509 musical dynamics annotated singing voice performances, aligned with 163 score files.
We train a CNN model with varying window sizes to evaluate the effectiveness of estimating musical dynamics.
We conclude through our experiments that bark-scale based features outperform log-Mel-features for the task of singing voice dynamics prediction.
arXiv Detail & Related papers (2024-10-27T18:15:18Z) - Cluster and Separate: a GNN Approach to Voice and Staff Prediction for Score Engraving [5.572472212662453]
This paper approaches the problem of separating the notes from a quantized symbolic music piece (e.g., a MIDI file) into multiple voices and staves.
We propose an end-to-end system based on graph neural networks that notes that belong to the same chord and connect them with edges if they are part of a voice.
arXiv Detail & Related papers (2024-07-15T14:36:13Z) - Accompanied Singing Voice Synthesis with Fully Text-controlled Melody [61.147446955297625]
Text-to-song (TTSong) is a music generation task that synthesizes accompanied singing voices.
We present MelodyLM, the first TTSong model that generates high-quality song pieces with fully text-controlled melodies.
arXiv Detail & Related papers (2024-07-02T08:23:38Z) - End-to-End Real-World Polyphonic Piano Audio-to-Score Transcription with Hierarchical Decoding [4.604877755214193]
Existing end-to-end piano A2S systems have been trained and evaluated with only synthetic data.
We propose a sequence-to-sequence (Seq2Seq) model with a hierarchical decoder that aligns with the hierarchical structure of musical scores.
We propose a two-stage training scheme, which involves pre-training the model using an expressive performance rendering system on synthetic audio, followed by fine-tuning the model using recordings of human performance.
arXiv Detail & Related papers (2024-05-22T10:52:04Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE.
It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description.
We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z) - AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment [67.10208647482109]
The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings.
This paper proposes AlignSTS, an STS model based on explicit cross-modal alignment.
Experiments show that AlignSTS achieves superior performance in terms of both objective and subjective metrics.
arXiv Detail & Related papers (2023-05-08T06:02:10Z) - Unaligned Supervision For Automatic Music Transcription in The Wild [1.2183405753834562]
NoteEM is a method for simultaneously training a transcriber and aligning the scores to their corresponding performances.
We report SOTA note-level accuracy of the MAPS dataset, and large favorable margins on cross-dataset evaluations.
arXiv Detail & Related papers (2022-04-28T17:31:43Z) - Deep Performer: Score-to-Audio Music Performance Synthesis [30.95307878579825]
Deep Performer is a novel system for score-to-audio music performance synthesis.
Unlike speech, music often contains polyphony and long notes.
We show that our proposed model can synthesize music with clear polyphony and harmonic structures.
arXiv Detail & Related papers (2022-02-12T10:36:52Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.