Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment
- URL: http://arxiv.org/abs/2404.09313v3
- Date: Mon, 20 May 2024 05:50:36 GMT
- Title: Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment
- Authors: Zhiqing Hong, Rongjie Huang, Xize Cheng, Yongqi Wang, Ruiqi Li, Fuming You, Zhou Zhao, Zhimeng Zhang,
- Abstract summary: We propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation.
We develop Melodist, a two-stage text-to-song method that consists of singing voice synthesis (SVS) and vocal-to-accompaniment (V2A) synthesis.
evaluation results on our dataset demonstrate that Melodist can synthesize songs with comparable quality and style consistency.
- Score: 56.019288564115136
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A song is a combination of singing voice and accompaniment. However, existing works focus on singing voice synthesis and music generation independently. Little attention was paid to explore song synthesis. In this work, we propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation. We develop Melodist, a two-stage text-to-song method that consists of singing voice synthesis (SVS) and vocal-to-accompaniment (V2A) synthesis. Melodist leverages tri-tower contrastive pretraining to learn more effective text representation for controllable V2A synthesis. A Chinese song dataset mined from a music website is built up to alleviate data scarcity for our research. The evaluation results on our dataset demonstrate that Melodist can synthesize songs with comparable quality and style consistency. Audio samples can be found in https://text2songMelodist.github.io/Sample/.
Related papers
- Sing-On-Your-Beat: Simple Text-Controllable Accompaniment Generations [5.56093728482997]
We propose a straightforward method that enables control over the accompaniment through text prompts.
Through extensive experiments, we successfully generate 10-second accompaniments using vocal input and text control.
arXiv Detail & Related papers (2024-11-03T19:17:20Z) - Constructing a Singing Style Caption Dataset [12.515874333424929]
We introduce S2Cap, an audio-text pair dataset with a diverse set of attributes.
S2Cap consists of pairs of textual prompts and music audio samples with a wide range of vocal and musical attributes.
We present a novel mechanism called CRESCENDO, which utilizes positive-pair similarity learning to synchronize the embedding spaces of a pretrained audio encoder.
arXiv Detail & Related papers (2024-09-15T21:19:24Z) - SongCreator: Lyrics-based Universal Song Generation [53.248473603201916]
SongCreator is a song-generation system designed to tackle the challenge of generating songs with both vocals and accompaniment given lyrics.
The model features two novel designs: a meticulously designed dual-sequence language model (M) to capture the information of vocals and accompaniment for song generation, and a series of attention mask strategies for DSLM.
Experiments demonstrate the effectiveness of SongCreator by achieving state-of-the-art or competitive performances on all eight tasks.
arXiv Detail & Related papers (2024-09-09T19:37:07Z) - Accompanied Singing Voice Synthesis with Fully Text-controlled Melody [61.147446955297625]
Text-to-song (TTSong) is a music generation task that synthesizes accompanied singing voices.
We present MelodyLM, the first TTSong model that generates high-quality song pieces with fully text-controlled melodies.
arXiv Detail & Related papers (2024-07-02T08:23:38Z) - Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - Unsupervised Melody-to-Lyric Generation [91.29447272400826]
We propose a method for generating high-quality lyrics without training on any aligned melody-lyric data.
We leverage the segmentation and rhythm alignment between melody and lyrics to compile the given melody into decoding constraints.
Our model can generate high-quality lyrics that are more on-topic, singable, intelligible, and coherent than strong baselines.
arXiv Detail & Related papers (2023-05-30T17:20:25Z) - SingSong: Generating musical accompaniments from singing [35.819589427197464]
We present SingSong, a system that generates instrumental music to accompany input vocals.
In a pairwise comparison with the same vocal inputs, listeners expressed a significant preference for instrumentals generated by SingSong.
arXiv Detail & Related papers (2023-01-30T04:53:23Z) - DeepSinger: Singing Voice Synthesis with Data Mined From the Web [194.10598657846145]
DeepSinger is a multi-lingual singing voice synthesis system built from scratch using singing training data mined from music websites.
We evaluate DeepSinger on our mined singing dataset that consists of about 92 hours data from 89 singers on three languages.
arXiv Detail & Related papers (2020-07-09T07:00:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.