Related papers: Versatile Framework for Song Generation with Prompt-based Control

Versatile Framework for Song Generation with Prompt-based Control

URL: http://arxiv.org/abs/2504.19062v2
Date: Tue, 29 Apr 2025 09:19:41 GMT
Title: Versatile Framework for Song Generation with Prompt-based Control
Authors: Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Ruiqi Li, Jingyu Lu, Rongjie Huang, Ruiyuan Zhang, Zhiqing Hong, Ziyue Jiang, Zhou Zhao,
Abstract summary: VersBand is a framework for synthesizing high-quality, aligned songs with prompt-based control.<n>VocalBand, a decoupled model, leverages the flow-matching method for generating singing styles, pitches, and mel-spectrograms.<n>AccompBand, a flow-based transformer model, incorporates the Band-MOE, selecting suitable experts for enhanced quality, alignment, and control.<n>Two generation models, LyricBand for lyrics and MelodyBand for melodies, contribute to the comprehensive multi-task song generation system.
Score: 50.359999116420084
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Song generation focuses on producing controllable high-quality songs based on various prompts. However, existing methods struggle to generate vocals and accompaniments with prompt-based control and proper alignment. Additionally, they fall short in supporting various tasks. To address these challenges, we introduce VersBand, a multi-task song generation framework for synthesizing high-quality, aligned songs with prompt-based control. VersBand comprises these primary models: 1) VocalBand, a decoupled model, leverages the flow-matching method for generating singing styles, pitches, and mel-spectrograms, allowing fast, high-quality vocal generation with style control. 2) AccompBand, a flow-based transformer model, incorporates the Band-MOE, selecting suitable experts for enhanced quality, alignment, and control. This model allows for generating controllable, high-quality accompaniments aligned with vocals. 3) Two generation models, LyricBand for lyrics and MelodyBand for melodies, contribute to the comprehensive multi-task song generation system, allowing for extensive control based on multiple prompts. Experimental results demonstrate that VersBand performs better over baseline models across multiple song generation tasks using objective and subjective metrics. Audio samples are available at https://aaronz345.github.io/VersBandDemo.

Related papers

LeVo: High-Quality Song Generation with Multi-Preference Alignment [49.94713419553945]
We introduce LeVo, an LM-based framework consisting of LeLM and a music accompaniment.<n>LeVo is capable of parallelly modeling two types of tokens: mixed tokens, which represent the combined audio of vocals and to achieve vocal-instrument harmony, and dual-track tokens, which separately encode vocals and accompaniment.<n> Experimental results demonstrate that LeVo consistently outperforms existing methods on both objective and subjective metrics.
arXiv Detail & Related papers (2025-06-09T07:57:24Z)
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation [75.86473375730392]
SongGen is a fully open-source, single-stage auto-regressive transformer for controllable song generation.<n>It supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately.<n>To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline.
arXiv Detail & Related papers (2025-02-18T18:52:21Z)
CSL-L2M: Controllable Song-Level Lyric-to-Melody Generation Based on Conditional Transformer with Fine-Grained Lyric and Musical Controls [28.13660104055298]
CSL-L2M is a controllable song-level lyric-to-melody generation method based on an in-attention Transformer decoder.<n> REMI-Aligned is a novel music representation that incorporates strict syllable- and sentence-level alignments between lyrics and melodies.<n>We introduce human-labeled musical tags, sentence-level statistical musical attributes, and learned musical features extracted from a pre-trained VQ-VAE.
arXiv Detail & Related papers (2024-12-13T06:05:53Z)
SongCreator: Lyrics-based Universal Song Generation [53.248473603201916]
SongCreator is a song-generation system designed to tackle the challenge of generating songs with both vocals and accompaniment given lyrics. The model features two novel designs: a meticulously designed dual-sequence language model (M) to capture the information of vocals and accompaniment for song generation, and a series of attention mask strategies for DSLM. Experiments demonstrate the effectiveness of SongCreator by achieving state-of-the-art or competitive performances on all eight tasks.
arXiv Detail & Related papers (2024-09-09T19:37:07Z)
BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features [19.284531698181116]
BandControlNet is designed to tackle the multiple music sequences and generate high-quality music samples conditioned to the giventemporal control features. The proposed BandControlNet outperforms other conditional music generation models on most objective metrics in terms of fidelity and inference speed. The subjective evaluations show trained on short datasets can generate music with comparable quality to state-of-the-art models, while outperforming significantly using BandControlNet.
arXiv Detail & Related papers (2024-07-15T06:33:25Z)
Accompanied Singing Voice Synthesis with Fully Text-controlled Melody [61.147446955297625]
Text-to-song (TTSong) is a music generation task that synthesizes accompanied singing voices. We present MelodyLM, the first TTSong model that generates high-quality song pieces with fully text-controlled melodies.
arXiv Detail & Related papers (2024-07-02T08:23:38Z)
Content-based Controls For Music Large Language Modeling [6.17674772485321]
Coco-Mulla is a content-based control method for music large language modeling. It uses a parameter-efficient fine-tuning (PEFT) method tailored for Transformer-based audio models. Our approach achieves high-quality music generation with low-resource semi-supervised learning.
arXiv Detail & Related papers (2023-10-26T05:24:38Z)
Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z)
Unsupervised Melody-to-Lyric Generation [91.29447272400826]
We propose a method for generating high-quality lyrics without training on any aligned melody-lyric data. We leverage the segmentation and rhythm alignment between melody and lyrics to compile the given melody into decoding constraints. Our model can generate high-quality lyrics that are more on-topic, singable, intelligible, and coherent than strong baselines.
arXiv Detail & Related papers (2023-05-30T17:20:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.