Related papers: MIDI-LLM: Adapting Large Language Models for Text-to-MIDI Music Generation

MIDI-LLM: Adapting Large Language Models for Text-to-MIDI Music Generation

URL: http://arxiv.org/abs/2511.03942v1
Date: Thu, 06 Nov 2025 00:40:07 GMT
Title: MIDI-LLM: Adapting Large Language Models for Text-to-MIDI Music Generation
Authors: Shih-Lun Wu, Yoon Kim, Cheng-Zhi Anna Huang,
Abstract summary: We present MIDI-LLM, an LLM for generating multitrack MIDI music from free-form text prompts.<n>Our approach expands a text LLM's vocabulary to include MIDI tokens, and uses a two-stage training recipe to endow text-to-MIDI abilities.
Score: 38.07213913075033
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present MIDI-LLM, an LLM for generating multitrack MIDI music from free-form text prompts. Our approach expands a text LLM's vocabulary to include MIDI tokens, and uses a two-stage training recipe to endow text-to-MIDI abilities. By preserving the original LLM's parameter structure, we can directly leverage the vLLM library for accelerated inference. Experiments show that MIDI-LLM achieves higher quality, better text control, and faster inference compared to the recent Text2midi model. Live demo at https://midi-llm-demo.vercel.app.

Related papers

InstructAudio: Unified speech and music generation with natural language instruction [52.76518112649456]
InstructAudio is a unified framework that enables instruction-based control of acoustic attributes.<n>It supports expressive speech, music, and dialogue generation in English and Chinese.
arXiv Detail & Related papers (2025-11-23T15:15:21Z)
The GigaMIDI Dataset with Features for Expressive Music Performance Detection [5.585625844344932]
The GigaMIDI dataset contains over 1.4 million unique MIDI files, encompassing 1.8 billion MIDI note events and over 5.3 million MIDI tracks.<n>This curated iteration of GigaMIDI encompasses expressively-performed instrument tracks detected by NOMML, constituting 31% of the GigaMIDI dataset.
arXiv Detail & Related papers (2025-02-24T23:39:40Z)
Text2midi: Generating Symbolic Music from Captions [7.133321587053803]
This paper introduces text2midi, an end-to-end model to generate MIDI files from textual descriptions.<n>We utilize a pretrained LLM encoder to process captions, which then condition an autoregressive transformer decoder to produce MIDI sequences.<n>We conduct comprehensive empirical evaluations, incorporating both automated and human studies, that show our model generates MIDI files of high quality.
arXiv Detail & Related papers (2024-12-21T08:09:12Z)
Annotation-Free MIDI-to-Audio Synthesis via Concatenative Synthesis and Generative Refinement [0.0]
CoSaRef is a MIDI-to-audio synthesis method that does not require MIDI-audio paired datasets.<n>It generates a synthetic audio track using concatenative synthesis based on MIDI input, then refines it with a diffusion-based deep generative model trained on datasets without MIDI annotations.<n>It allows detailed control over timbres and expression through audio sample selection and extra MIDI design, similar to traditional functions in digital audio workstations.
arXiv Detail & Related papers (2024-10-22T08:01:40Z)
REFFLY: Melody-Constrained Lyrics Editing Model [50.03960548399128]
This paper introduces REFFLY, the first revision framework for editing and generating melody-aligned lyrics.<n>We train the lyric revision module using our synthesized melody-aligned lyrics dataset.<n>To further enhance the revision ability, we propose training-frees aimed at preserving both semantic meaning and musical consistency.
arXiv Detail & Related papers (2024-08-30T23:22:34Z)
Accompanied Singing Voice Synthesis with Fully Text-controlled Melody [61.147446955297625]
Text-to-song (TTSong) is a music generation task that synthesizes accompanied singing voices. We present MelodyLM, the first TTSong model that generates high-quality song pieces with fully text-controlled melodies.
arXiv Detail & Related papers (2024-07-02T08:23:38Z)
MidiCaps: A large-scale MIDI dataset with text captions [6.806050368211496]
This work aims to enable research that combines LLMs with symbolic music by presenting, the first openly available large-scale MIDI dataset with text captions. Inspired by recent advancements in captioning techniques, we present a curated dataset of over 168k MIDI files with textual descriptions.
arXiv Detail & Related papers (2024-06-04T12:21:55Z)
Noise2Music: Text-conditioned Music Generation with Diffusion Models [73.74580231353684]
We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era. Pretrained large language models play a key role in this story -- they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models.
arXiv Detail & Related papers (2023-02-08T07:27:27Z)
Composer's Assistant: An Interactive Transformer for Multi-Track MIDI Infilling [0.0]
Composer's Assistant is a system for interactive human-computer composition in the REAPER digital audio workstation. We train a T5-like model to accomplish the task of multi-track MIDI infilling. Composer's Assistant consists of this model together with scripts that enable interaction with the model in REAPER.
arXiv Detail & Related papers (2023-01-29T19:45:10Z)
PopMAG: Pop Music Accompaniment Generation [190.09996798215738]
We propose a novel MUlti-track MIDI representation (MuMIDI) which enables simultaneous multi-track generation in a single sequence. MuMIDI enlarges the sequence length and brings the new challenge of long-term music modeling. We call our system for pop music accompaniment generation as PopMAG.
arXiv Detail & Related papers (2020-08-18T02:28:36Z)
Foley Music: Learning to Generate Music from Videos [115.41099127291216]
Foley Music is a system that can synthesize plausible music for a silent video clip about people playing musical instruments. We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from audio recordings. We present a Graph$-$Transformer framework that can accurately predict MIDI event sequences in accordance with the body movements.
arXiv Detail & Related papers (2020-07-21T17:59:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.