Integrating Text-to-Music Models with Language Models: Composing Long Structured Music Pieces
- URL: http://arxiv.org/abs/2410.00344v3
- Date: Sat, 5 Oct 2024 19:31:33 GMT
- Title: Integrating Text-to-Music Models with Language Models: Composing Long Structured Music Pieces
- Authors: Lilac Atassi,
- Abstract summary: This paper proposes integrating a text-to-music model with a large language model to generate music with form.
The experimental results show that the proposed method can generate 2.5-minute-long music that is highly structured, strongly organized, and cohesive.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent music generation methods based on transformers have a context window of up to a minute. The music generated by these methods is largely unstructured beyond the context window. With a longer context window, learning long-scale structures from musical data is a prohibitively challenging problem. This paper proposes integrating a text-to-music model with a large language model to generate music with form. The papers discusses the solutions to the challenges of such integration. The experimental results show that the proposed method can generate 2.5-minute-long music that is highly structured, strongly organized, and cohesive.
Related papers
- CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models [51.03510073676228]
CLaMP 2 is a system compatible with 101 languages for music information retrieval.
By leveraging large language models, we obtain refined and consistent multilingual descriptions at scale.
CLaMP 2 achieves state-of-the-art results in both multilingual semantic search and music classification across modalities.
arXiv Detail & Related papers (2024-10-17T06:43:54Z) - CoLLAP: Contrastive Long-form Language-Audio Pretraining with Musical Temporal Structure Augmentation [17.41880273107978]
Contrastive Long-form Language-Audio Pretraining (textbfCoLLAP)
We propose Contrastive Long-form Language-Audio Pretraining (textbfCoLLAP) to significantly extend the perception window for both the input audio (up to 5 minutes) and the language descriptions (exceeding 250 words)
We collect 51.3K audio-text pairs derived from the large-scale AudioSet training dataset, where the average audio length reaches 288 seconds.
arXiv Detail & Related papers (2024-10-03T07:46:51Z) - Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning [24.6866990804501]
Instruct-MusicGen is a novel approach that finetunes a pretrained MusicGen model to efficiently follow editing instructions.
Remarkably, Instruct-MusicGen only introduces 8% new parameters to the original MusicGen model and only trains for 5K steps.
arXiv Detail & Related papers (2024-05-28T17:27:20Z) - MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models [24.582948932985726]
This paper introduces a novel approach to the editing of music generated by text-to-music models.
Our method transforms text editing to textitlatent space manipulation while adding an extra constraint to enforce consistency.
Experimental results demonstrate superior performance over both zero-shot and certain supervised baselines in style and timbre transfer evaluations.
arXiv Detail & Related papers (2024-02-09T04:34:08Z) - Musical Form Generation [0.0]
This paper introduces an approach for generating structured, arbitrarily long musical pieces.
Central to this approach is the creation of musical segments using a conditional generative model.
The generation of prompts that determine the high-level composition is distinct from the creation of finer, lower-level details.
arXiv Detail & Related papers (2023-10-30T08:02:08Z) - Unsupervised Melody-to-Lyric Generation [91.29447272400826]
We propose a method for generating high-quality lyrics without training on any aligned melody-lyric data.
We leverage the segmentation and rhythm alignment between melody and lyrics to compile the given melody into decoding constraints.
Our model can generate high-quality lyrics that are more on-topic, singable, intelligible, and coherent than strong baselines.
arXiv Detail & Related papers (2023-05-30T17:20:25Z) - ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models [67.66825818489406]
This paper introduces a text-to-waveform music generation model, underpinned by the utilization of diffusion models.
Our methodology hinges on the innovative incorporation of free-form textual prompts as conditional factors to guide the waveform generation process.
We demonstrate that our generated music in the waveform domain outperforms previous works by a large margin in terms of diversity, quality, and text-music relevance.
arXiv Detail & Related papers (2023-02-09T06:27:09Z) - Noise2Music: Text-conditioned Music Generation with Diffusion Models [73.74580231353684]
We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts.
We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era.
Pretrained large language models play a key role in this story -- they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models.
arXiv Detail & Related papers (2023-02-08T07:27:27Z) - Melody Infilling with User-Provided Structural Context [37.55332319528369]
This paper proposes a novel Transformer-based model for music score infilling.
We show that the proposed model can harness the structural information effectively and generate melodies in the style of pop of higher quality.
arXiv Detail & Related papers (2022-10-06T11:37:04Z) - Lets Play Music: Audio-driven Performance Video Generation [58.77609661515749]
We propose a new task named Audio-driven Per-formance Video Generation (APVG)
APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
arXiv Detail & Related papers (2020-11-05T03:13:46Z) - SongNet: Rigid Formats Controlled Text Generation [51.428634666559724]
We propose a simple and elegant framework named SongNet to tackle this problem.
The backbone of the framework is a Transformer-based auto-regressive language model.
A pre-training and fine-tuning framework is designed to further improve the generation quality.
arXiv Detail & Related papers (2020-04-17T01:40:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.