Melody Is All You Need For Music Generation
- URL: http://arxiv.org/abs/2409.20196v2
- Date: Thu, 3 Oct 2024 07:39:20 GMT
- Title: Melody Is All You Need For Music Generation
- Authors: Shaopeng Wei, Manzhen Wei, Haoyu Wang, Yu Zhao, Gang Kou,
- Abstract summary: We present the Melody Guided Music Generation (MMGen) model, the first novel approach using melody to guide the music generation.
Specifically, we first align the melody with audio waveforms and their associated descriptions using the multimodal alignment module.
This allows MMGen to generate music that matches the style of the provided audio while also producing music that reflects the content of the given text description.
- Score: 10.366088659024685
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present the Melody Guided Music Generation (MMGen) model, the first novel approach using melody to guide the music generation that, despite a pretty simple method and extremely limited resources, achieves excellent performance. Specifically, we first align the melody with audio waveforms and their associated descriptions using the multimodal alignment module. Subsequently, we condition the diffusion module on the learned melody representations. This allows MMGen to generate music that matches the style of the provided audio while also producing music that reflects the content of the given text description. To address the scarcity of high-quality data, we construct a multi-modal dataset, MusicSet, which includes melody, text, and audio, and will be made publicly available. We conduct extensive experiments which demonstrate the superiority of the proposed model both in terms of experimental metrics and actual performance quality.
Related papers
- Text2midi-InferAlign: Improving Symbolic Music Generation with Inference-Time Alignment [6.806050368211496]
We present Text2midi-InferAlign, a novel technique for improving symbolic music generation at inference time.<n>Our method leverages text-to-audio alignment and music structural alignment rewards during inference to encourage the generated music to be consistent with the input caption.
arXiv Detail & Related papers (2025-05-19T03:36:06Z) - SongGLM: Lyric-to-Melody Generation with 2D Alignment Encoding and Multi-Task Pre-Training [7.3026780262967685]
SongGLM is a lyric-to-melody generation system that leverages 2D alignment encoding and multi-task pre-training.
We construct a large-scale lyric-melody paired dataset comprising over 200,000 English song pieces for pre-training and fine-tuning.
arXiv Detail & Related papers (2024-12-24T02:30:07Z) - MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [57.47799823804519]
We are inspired by how musicians compose music not just from a movie script, but also through visualizations.
We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music.
Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music.
arXiv Detail & Related papers (2024-06-07T06:38:59Z) - SongComposer: A Large Language Model for Lyric and Melody Generation in Song Composition [82.38021790213752]
SongComposer is a music-specialized large language model (LLM)<n>It integrates the capability of simultaneously composing melodies into LLMs by leveraging three key innovations.<n>It outperforms advanced LLMs in tasks such as lyric-to-melody generation, melody-to-lyric generation, song continuation, and text-to-song creation.<n>We will release SongCompose, a large-scale dataset for training, containing paired lyrics and melodies in Chinese and English.
arXiv Detail & Related papers (2024-02-27T16:15:28Z) - LOAF-M2L: Joint Learning of Wording and Formatting for Singable Melody-to-Lyric Generation [7.102743887290909]
This paper bridges the singability gap with a novel approach to generating singable lyrics by jointly Learning wOrding And during Melody-to-Lyric training.
After general-domain pretraining, our proposed model acquires length awareness first from a large text-only lyric corpus.
Then, we introduce a new objective informed by musicological research on the relationship between melody and lyrics during melody-to-lyric training, which enables the model to learn the fine-grained format requirements of the melody.
arXiv Detail & Related papers (2023-07-05T09:42:47Z) - Controllable Lyrics-to-Melody Generation [14.15838552524433]
We propose a controllable lyrics-to-melody generation network, ConL2M, which is able to generate realistic melodies from lyrics in user-desired musical style.
Our work contains three main novelties: 1) To model the dependencies of music attributes cross multiple sequences, inter-branch memory fusion (Memofu) is proposed to enable information flow between multi-branch stacked LSTM architecture; 2) Reference style embedding (RSE) is proposed to improve the quality of generation as well as control the musical style of generated melodies; 3) Sequence-level statistical loss (SeqLoss) is proposed to help the model learn sequence-level
arXiv Detail & Related papers (2023-06-05T06:14:08Z) - Unsupervised Melody-to-Lyric Generation [91.29447272400826]
We propose a method for generating high-quality lyrics without training on any aligned melody-lyric data.
We leverage the segmentation and rhythm alignment between melody and lyrics to compile the given melody into decoding constraints.
Our model can generate high-quality lyrics that are more on-topic, singable, intelligible, and coherent than strong baselines.
arXiv Detail & Related papers (2023-05-30T17:20:25Z) - Unsupervised Melody-Guided Lyrics Generation [84.22469652275714]
We propose to generate pleasantly listenable lyrics without training on melody-lyric aligned data.
We leverage the crucial alignments between melody and lyrics and compile the given melody into constraints to guide the generation process.
arXiv Detail & Related papers (2023-05-12T20:57:20Z) - Noise2Music: Text-conditioned Music Generation with Diffusion Models [73.74580231353684]
We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts.
We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era.
Pretrained large language models play a key role in this story -- they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models.
arXiv Detail & Related papers (2023-02-08T07:27:27Z) - Re-creation of Creations: A New Paradigm for Lyric-to-Melody Generation [158.54649047794794]
Re-creation of Creations (ROC) is a new paradigm for lyric-to-melody generation.
ROC achieves good lyric-melody feature alignment in lyric-to-melody generation.
arXiv Detail & Related papers (2022-08-11T08:44:47Z) - TeleMelody: Lyric-to-Melody Generation with a Template-Based Two-Stage
Method [92.36505210982648]
TeleMelody is a two-stage lyric-to-melody generation system with music template.
It generates melodies with higher quality, better controllability, and less requirement on paired lyric-melody data.
arXiv Detail & Related papers (2021-09-20T15:19:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.