Related papers: Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings

Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings

URL: http://arxiv.org/abs/2407.19900v1
Date: Mon, 29 Jul 2024 11:24:10 GMT
Title: Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings
Authors: Seungyeon Rhyu, Kichang Yang, Sungjun Cho, Jaehyeon Kim, Kyogu Lee, Moontae Lee,
Abstract summary: Music generation introduces challenging complexities to large language models. Existing works share three major drawbacks: 1) their tokenization requires domain-specific annotations, such as bars and beats, that are typically missing in raw MIDI data. We develop a MIDI-based music generation framework inspired by MuseNet, empirically studying two structural embeddings that do not rely on domain-specific annotations.
Score: 28.685224087199053
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Music generation introduces challenging complexities to large language models. Symbolic structures of music often include vertical harmonization as well as horizontal counterpoint, urging various adaptations and enhancements for large-scale Transformers. However, existing works share three major drawbacks: 1) their tokenization requires domain-specific annotations, such as bars and beats, that are typically missing in raw MIDI data; 2) the pure impact of enhancing token embedding methods is hardly examined without domain-specific annotations; and 3) existing works to overcome the aforementioned drawbacks, such as MuseNet, lack reproducibility. To tackle such limitations, we develop a MIDI-based music generation framework inspired by MuseNet, empirically studying two structural embeddings that do not rely on domain-specific annotations. We provide various metrics and insights that can guide suitable encoding to deploy. We also verify that multiple embedding configurations can selectively boost certain musical aspects. By providing open-source implementations via HuggingFace, our findings shed light on leveraging large language models toward practical and reproducible music generation.

Related papers

Music Flamingo: Scaling Music Understanding in Audio Language Models [98.94537017112704]
Music Flamingo is a novel large audio-language model designed to advance music understanding in foundational audio models.<n> MF-Skills is a dataset labeled through a multi-stage pipeline that yields rich captions and question-answer pairs covering harmony, structure, timbre, lyrics, and cultural context.<n>We introduce a post-training recipe: we first cold-start with MF-Think, a novel chain-of-thought dataset grounded in music theory, followed by GRPO-based reinforcement learning with custom rewards.
arXiv Detail & Related papers (2025-11-13T13:21:09Z)
YuE: Scaling Open Foundation Models for Long-Form Music Generation [134.54174498094565]
YuE is a family of open foundation models based on the LLaMA2 architecture. It generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment.
arXiv Detail & Related papers (2025-03-11T17:26:50Z)
PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation [5.201151187019607]
PerceiverS (Segmentation and Scale) is a novel architecture designed to generate long-structured and expressive music. Our approach enhances symbolic music generation by simultaneously learning long-term structural dependencies and short-term expressive details. The proposed model, evaluated on datasets like Maestro, demonstrates improvements in generating coherent and diverse music.
arXiv Detail & Related papers (2024-11-13T03:14:10Z)
UniMuMo: Unified Text, Music and Motion Generation [57.72514622935806]
We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities. By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture.
arXiv Detail & Related papers (2024-10-06T16:04:05Z)
PerTok: Expressive Encoding and Modeling of Symbolic Musical Ideas and Variations [0.3683202928838613]
Cadenza is a new multi-stage generative framework for predicting expressive variations of symbolic musical ideas. The proposed framework comprises of two sequential stages: 1) Composer and 2) Performer. Our framework is designed, researched and implemented with the objective of providing inspiration for musicians.
arXiv Detail & Related papers (2024-10-02T22:11:31Z)
MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music. To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation) Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z)
Structure-informed Positional Encoding for Music Generation [0.0]
We propose a structure-informed positional encoding framework for music generation with Transformers. We test them on two symbolic music generation tasks: next-timestep prediction and accompaniment generation. Our methods improve the melodic and structural consistency of the generated pieces.
arXiv Detail & Related papers (2024-02-20T13:41:35Z)
Multi-view MidiVAE: Fusing Track- and Bar-view Representations for Long Multi-track Symbolic Music Generation [50.365392018302416]
We propose Multi-view MidiVAE, as one of the pioneers in VAE methods that effectively model and generate long multi-track symbolic music. We focus on instrumental characteristics and harmony as well as global and local information about the musical composition by employing a hybrid variational encoding-decoding strategy.
arXiv Detail & Related papers (2024-01-15T08:41:01Z)
Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z)
The Music Annotation Pattern [1.2043574473965315]
We introduce the Music Pattern, an Ontology Design Pattern (ODP) to homogenise different annotation systems and to represent several types of musical objects. Our ODP accounts for multi-modality upfront, to describe annotations derived from different sources, and it is the first to enable the integration of music datasets at a large scale.
arXiv Detail & Related papers (2023-03-30T11:13:59Z)
The Power of Reuse: A Multi-Scale Transformer Model for Structural Dynamic Segmentation in Symbolic Music Generation [6.0949335132843965]
Symbolic Music Generation relies on the contextual representation capabilities of the generative model. We propose a multi-scale Transformer, which uses coarse-decoder and fine-decoders to model the contexts at the global and section-level. Our model is evaluated on two open MIDI datasets, and experiments show that our model outperforms the best contemporary symbolic music generative models.
arXiv Detail & Related papers (2022-05-17T18:48:14Z)
Symphony Generation with Permutation Invariant Language Model [57.75739773758614]
We present a symbolic symphony music generation solution, SymphonyNet, based on a permutation invariant language model. A novel transformer decoder architecture is introduced as backbone for modeling extra-long sequences of symphony tokens. Our empirical results show that our proposed approach can generate coherent, novel, complex and harmonious symphony compared to human composition.
arXiv Detail & Related papers (2022-05-10T13:08:49Z)
Learning Interpretable Representation for Controllable Polyphonic Music Generation [5.01266258109807]
We design a novel architecture, that effectively learns two interpretable latent factors of polyphonic music: chord and texture. We show that such chord-texture disentanglement provides a controllable generation pathway leading to a wide spectrum of applications.
arXiv Detail & Related papers (2020-08-17T07:11:16Z)
SongNet: Rigid Formats Controlled Text Generation [51.428634666559724]
We propose a simple and elegant framework named SongNet to tackle this problem. The backbone of the framework is a Transformer-based auto-regressive language model. A pre-training and fine-tuning framework is designed to further improve the generation quality.
arXiv Detail & Related papers (2020-04-17T01:40:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.