Related papers: miditok: A Python package for MIDI file tokenization

Related papers

TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument [19.395289629201056]
Token Synth is a novel neural synthesizer that generates audio tokens from MIDI tokens and CLAP embedding. Our model is capable of performing instrument cloning, text-to-instrument synthesis, and text-guided timbre manipulation.
arXiv Detail & Related papers (2025-02-13T03:40:30Z)
Text2midi: Generating Symbolic Music from Captions [7.133321587053803]
This paper introduces text2midi, an end-to-end model to generate MIDI files from textual descriptions. We utilize a pretrained LLM encoder to process captions, which then condition an autoregressive transformer decoder to produce MIDI sequences. We conduct comprehensive empirical evaluations, incorporating both automated and human studies, that show our model generates MIDI files of high quality.
arXiv Detail & Related papers (2024-12-21T08:09:12Z)
End-to-end Piano Performance-MIDI to Score Conversion with Transformers [26.900974153235456]
We present an end-to-end deep learning approach that constructs detailed musical scores directly from real-world piano performance-MIDI files. We introduce a modern transformer-based architecture with a novel tokenized representation for symbolic music data. Our method is also the first to directly predict notational details like trill marks or stem direction from performance data.
arXiv Detail & Related papers (2024-09-30T20:11:37Z)
Nested Music Transformer: Sequentially Decoding Compound Tokens in Symbolic Music and Audio Generation [2.668651175000492]
Representing symbolic music with compound tokens, where each token consists of several different sub-tokens, offers the advantage of reducing sequence length. We introduce the Nested Music Transformer (NMT), an architecture tailored for decoding compound tokens autoregressively, similar to processing flattened tokens, but with low memory usage. Experiment results showed that applying the NMT to compound tokens can enhance the performance in terms of better perplexity in processing various symbolic music datasets and discrete audio tokens from the MAESTRO dataset.
arXiv Detail & Related papers (2024-08-02T11:02:38Z)
MidiCaps: A large-scale MIDI dataset with text captions [6.806050368211496]
This work aims to enable research that combines LLMs with symbolic music by presenting, the first openly available large-scale MIDI dataset with text captions. Inspired by recent advancements in captioning techniques, we present a curated dataset of over 168k MIDI files with textual descriptions.
arXiv Detail & Related papers (2024-06-04T12:21:55Z)
MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music. To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation) Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z)
Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z)
GETMusic: Generating Any Music Tracks with a Unified Representation and Diffusion Framework [58.64512825534638]
Symbolic music generation aims to create musical notes, which can help users compose music. We introduce a framework known as GETMusic, with GET'' standing for GEnerate music Tracks'' GETScore represents musical notes as tokens and organizes tokens in a 2D structure, with tracks stacked vertically and progressing horizontally over time. Our proposed representation, coupled with the non-autoregressive generative model, empowers GETMusic to generate music with any arbitrary source-target track combinations.
arXiv Detail & Related papers (2023-05-18T09:53:23Z)
Byte Pair Encoding for Symbolic Music [0.0]
Byte Pair embeddings significantly decreases the sequence length while increasing the vocabulary size. We leverage the embedding capabilities of such models with more expressive tokens, resulting in both better results and faster inference in generation and classification tasks. The source code is shared on Github, along with a companion website.
arXiv Detail & Related papers (2023-01-27T20:22:18Z)
Symphony Generation with Permutation Invariant Language Model [57.75739773758614]
We present a symbolic symphony music generation solution, SymphonyNet, based on a permutation invariant language model. A novel transformer decoder architecture is introduced as backbone for modeling extra-long sequences of symphony tokens. Our empirical results show that our proposed approach can generate coherent, novel, complex and harmonious symphony compared to human composition.
arXiv Detail & Related papers (2022-05-10T13:08:49Z)
DadaGP: A Dataset of Tokenized GuitarPro Songs for Sequence Models [25.15855175804765]
DadaGP is a new symbolic music dataset comprising 26,181 song scores in the GuitarPro format covering 739 musical genres. DadaGP is released with an encoder/decoder which converts GuitarPro files to tokens and back. We present results of a use case in which DadaGP is used to train a Transformer-based model to generate new songs in GuitarPro format.
arXiv Detail & Related papers (2021-07-30T14:21:36Z)
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training [97.91071692716406]
Symbolic music understanding refers to the understanding of music from the symbolic data. MusicBERT is a large-scale pre-trained model for music understanding.
arXiv Detail & Related papers (2021-06-10T10:13:05Z)
Foley Music: Learning to Generate Music from Videos [115.41099127291216]
Foley Music is a system that can synthesize plausible music for a silent video clip about people playing musical instruments. We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from audio recordings. We present a Graph$-$Transformer framework that can accurately predict MIDI event sequences in accordance with the body movements.
arXiv Detail & Related papers (2020-07-21T17:59:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.