Accompanied Singing Voice Synthesis with Fully Text-controlled Melody
- URL: http://arxiv.org/abs/2407.02049v1
- Date: Tue, 2 Jul 2024 08:23:38 GMT
- Title: Accompanied Singing Voice Synthesis with Fully Text-controlled Melody
- Authors: Ruiqi Li, Zhiqing Hong, Yongqi Wang, Lichao Zhang, Rongjie Huang, Siqi Zheng, Zhou Zhao,
- Abstract summary: Text-to-song (TTSong) is a music generation task that synthesizes accompanied singing voices.
We present MelodyLM, the first TTSong model that generates high-quality song pieces with fully text-controlled melodies.
- Score: 61.147446955297625
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-song (TTSong) is a music generation task that synthesizes accompanied singing voices. Current TTSong methods, inherited from singing voice synthesis (SVS), require melody-related information that can sometimes be impractical, such as music scores or MIDI sequences. We present MelodyLM, the first TTSong model that generates high-quality song pieces with fully text-controlled melodies, achieving minimal user requirements and maximum control flexibility. MelodyLM explicitly models MIDI as the intermediate melody-related feature and sequentially generates vocal tracks in a language model manner, conditioned on textual and vocal prompts. The accompaniment music is subsequently synthesized by a latent diffusion model with hybrid conditioning for temporal alignment. With minimal requirements, users only need to input lyrics and a reference voice to synthesize a song sample. For full control, just input textual prompts or even directly input MIDI. Experimental results indicate that MelodyLM achieves superior performance in terms of both objective and subjective metrics. Audio samples are available at https://melodylm666.github.io.
Related papers
- Annotation-Free MIDI-to-Audio Synthesis via Concatenative Synthesis and Generative Refinement [0.0]
CoSaRef is a MIDI-to-audio synthesis method that can be developed without MIDI-audio paired datasets.
It first performs concatenative synthesis based on MIDI inputs and then refines the resulting audio into realistic tracks using a diffusion-based deep generative model trained on audio-only datasets.
arXiv Detail & Related papers (2024-10-22T08:01:40Z) - Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment [56.019288564115136]
We propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation.
We develop Melodist, a two-stage text-to-song method that consists of singing voice synthesis (SVS) and vocal-to-accompaniment (V2A) synthesis.
evaluation results on our dataset demonstrate that Melodist can synthesize songs with comparable quality and style consistency.
arXiv Detail & Related papers (2024-04-14T18:00:05Z) - Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - Unsupervised Melody-to-Lyric Generation [91.29447272400826]
We propose a method for generating high-quality lyrics without training on any aligned melody-lyric data.
We leverage the segmentation and rhythm alignment between melody and lyrics to compile the given melody into decoding constraints.
Our model can generate high-quality lyrics that are more on-topic, singable, intelligible, and coherent than strong baselines.
arXiv Detail & Related papers (2023-05-30T17:20:25Z) - RMSSinger: Realistic-Music-Score based Singing Voice Synthesis [56.51475521778443]
RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types.
We propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input.
In RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment.
arXiv Detail & Related papers (2023-05-18T03:57:51Z) - A Melody-Unsupervision Model for Singing Voice Synthesis [9.137554315375919]
We propose a melody-unsupervision model that requires only audio-and-lyrics pairs without temporal alignment in training time.
We show that the proposed model is capable of being trained with speech audio and text labels but can generate singing voice in inference time.
arXiv Detail & Related papers (2021-10-13T07:42:35Z) - TeleMelody: Lyric-to-Melody Generation with a Template-Based Two-Stage
Method [92.36505210982648]
TeleMelody is a two-stage lyric-to-melody generation system with music template.
It generates melodies with higher quality, better controllability, and less requirement on paired lyric-melody data.
arXiv Detail & Related papers (2021-09-20T15:19:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.