Related papers: Moonbeam: A MIDI Foundation Model Using Both Absolute and Relative Music Attributes

Moonbeam: A MIDI Foundation Model Using Both Absolute and Relative Music Attributes

URL: http://arxiv.org/abs/2505.15559v1
Date: Wed, 21 May 2025 14:17:25 GMT
Title: Moonbeam: A MIDI Foundation Model Using Both Absolute and Relative Music Attributes
Authors: Zixun Guo, Simon Dixon,
Abstract summary: Moonbeam is a transformer-based foundation model for symbolic music.<n>It is pretrained on a large and diverse collection of MIDI data totaling 81.6K hours of music and 18 billion tokens.<n>We open-source the code, pretrained model, and generated samples on Github.
Score: 9.283206048560322
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Moonbeam is a transformer-based foundation model for symbolic music, pretrained on a large and diverse collection of MIDI data totaling 81.6K hours of music and 18 billion tokens. Moonbeam incorporates music-domain inductive biases by capturing both absolute and relative musical attributes through the introduction of a novel domain-knowledge-inspired tokenization method and Multidimensional Relative Attention (MRA), which captures relative music information without additional trainable parameters. Leveraging the pretrained Moonbeam, we propose 2 finetuning architectures with full anticipatory capabilities, targeting 2 categories of downstream tasks: symbolic music understanding and conditional music generation (including music infilling). Our model outperforms other large-scale pretrained music models in most cases in terms of accuracy and F1 score across 3 downstream music classification tasks on 4 datasets. Moreover, our finetuned conditional music generation model outperforms a strong transformer baseline with a REMI-like tokenizer. We open-source the code, pretrained model, and generated samples on Github.

Related papers

LeVo: High-Quality Song Generation with Multi-Preference Alignment [49.94713419553945]
We introduce LeVo, an LM-based framework consisting of LeLM and a music accompaniment.<n>LeVo is capable of parallelly modeling two types of tokens: mixed tokens, which represent the combined audio of vocals and to achieve vocal-instrument harmony, and dual-track tokens, which separately encode vocals and accompaniment.<n> Experimental results demonstrate that LeVo consistently outperforms existing methods on both objective and subjective metrics.
arXiv Detail & Related papers (2025-06-09T07:57:24Z)
InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation [43.690876909464336]
We introduce InspireMusic, a framework integrated super resolution and large language model for high-fidelity long-form music generation.<n>A unified framework generates high-fidelity music, songs, and audio, which incorporates an autoregressive transformer with a super-resolution flow-matching model.<n>Our model differs from previous approaches, as we utilize an audio tokenizer with one codebook that contains richer semantic information.
arXiv Detail & Related papers (2025-02-28T09:58:25Z)
Detecting Music Performance Errors with Transformers [3.6837762419929168]
Existing tools for music error detection rely on automatic alignment.<n>There is a lack of sufficient data to train music error detection models.<n>We present a novel data generation technique capable of creating large-scale synthetic music error datasets.
arXiv Detail & Related papers (2025-01-03T07:04:20Z)
MusicFlow: Cascaded Flow Matching for Text Guided Music Generation [53.63948108922333]
MusicFlow is a cascaded text-to-music generation model based on flow matching. We leverage masked prediction as the training objective, enabling the model to generalize to other tasks such as music infilling and continuation.
arXiv Detail & Related papers (2024-10-27T15:35:41Z)
UniMuMo: Unified Text, Music and Motion Generation [57.72514622935806]
We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities. By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture.
arXiv Detail & Related papers (2024-10-06T16:04:05Z)
MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music. To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation) Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z)
MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z)
Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z)
MATT: A Multiple-instance Attention Mechanism for Long-tail Music Genre Classification [1.8275108630751844]
Imbalanced music genre classification is a crucial task in the Music Information Retrieval (MIR) field. Most of the existing models are designed for class-balanced music datasets. We propose a novel mechanism named Multi-instance Attention (MATT) to boost the performance for identifying tail classes.
arXiv Detail & Related papers (2022-09-09T03:52:44Z)
BERT-like Pre-training for Symbolic Piano Music Classification Tasks [15.02723006489356]
This article presents a benchmark study of symbolic piano music classification using the Bidirectional Representations from Transformers (BERT) approach. We pre-train two 12-layer Transformer models using the BERT approach and fine-tune them for four downstream classification tasks. Our evaluation shows that the BERT approach leads to higher classification accuracy than recurrent neural network (RNN)-based baselines.
arXiv Detail & Related papers (2021-07-12T07:03:57Z)
MuseMorphose: Full-Song and Fine-Grained Music Style Transfer with Just One Transformer VAE [36.9033909878202]
Transformer and variational autoencoders (VAE) have been extensively employed for symbolic (e.g., MIDI) domain music generation. In this paper, we are interested in bringing the two together to construct a single model that exhibits both strengths. Experiments show that MuseMorphose outperforms recurrent neural network (RNN) based prior art on numerous widely-used metrics for style transfer tasks.
arXiv Detail & Related papers (2021-05-10T03:44:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.