Efficient Long-Sequence Diffusion Modeling for Symbolic Music Generation
- URL: http://arxiv.org/abs/2603.00576v1
- Date: Sat, 28 Feb 2026 09:54:02 GMT
- Title: Efficient Long-Sequence Diffusion Modeling for Symbolic Music Generation
- Authors: Jinhan Xu, Xing Tang, Houpeng Yang, Haoran Zhang, Shenghua Yuan, Jiatao Chen, Tianming Xi, Jing Wang, Jiaojiao Yu, Guangli Xiang,
- Abstract summary: Symbolic music generation is a challenging task, involving long sequences with hierarchical temporal structures, long-range dependencies, and fine-grained local details.<n>We propose a diffusing strategy named SMDIM to combine efficient global structure construction and light local refinement.<n>Experiments show that the model outperforms the other state-of-the-art approaches on both the generation quality and the computational efficiency.
- Score: 5.290828305368797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Symbolic music generation is a challenging task in multimedia generation, involving long sequences with hierarchical temporal structures, long-range dependencies, and fine-grained local details. Though recent diffusion-based models produce high quality generations, they tend to suffer from high training and inference costs with long symbolic sequences due to iterative denoising and sequence-length-related costs. To deal with such problem, we put forth a diffusing strategy named SMDIM to combine efficient global structure construction and light local refinement. SMDIM uses structured state space models to capture long range musical context at near linear cost, and selectively refines local musical details via a hybrid refinement scheme. Experiments performed on a wide range of symbolic music datasets which encompass various Western classical music, popular music and traditional folk music show that the SMDIM model outperforms the other state-of-the-art approaches on both the generation quality and the computational efficiency, and it has robust generalization to underexplored musical styles. These results show that SMDIM offers a principled solution for long-sequence symbolic music generation, including associated attributes that accompany the sequences. We provide a project webpage with audio examples and supplementary materials at https://3328702107.github.io/smdim-music/.
Related papers
- Muse: Towards Reproducible Long-Form Song Generation with Fine-Grained Style Control [66.46754271097555]
We release a fully open-source system for long-form song generation with fine-grained style conditioning.<n>The dataset consists of 116k fully licensed synthetic songs with automatically generated lyrics and style descriptions.<n>We train Muse via single-stage supervised finetuning of a Qwen-based language model extended with discrete audio tokens.
arXiv Detail & Related papers (2026-01-07T14:40:48Z) - Scaling Self-Supervised Representation Learning for Symbolic Piano Performance [52.661197827466886]
We study the capabilities of generative autoregressive transformer models trained on large amounts of symbolic solo-piano transcriptions.<n>We use a comparatively smaller, high-quality subset to finetune models to produce musical continuations, perform symbolic classification tasks, and produce general-purpose contrastive MIDI embeddings.
arXiv Detail & Related papers (2025-06-30T14:00:14Z) - Extending Visual Dynamics for Video-to-Music Generation [51.274561293909926]
DyViM is a novel framework to enhance dynamics modeling for video-to-music generation.<n>High-level semantics are conveyed through a cross-attention mechanism.<n>Experiments demonstrate DyViM's superiority over state-of-the-art (SOTA) methods.
arXiv Detail & Related papers (2025-04-10T09:47:26Z) - PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation [4.101665207455494]
PerceiverS (Segmentation and Scale) is a novel architecture designed to generate long-structured and expressive music.<n>Our approach enhances symbolic music generation by simultaneously learning long-term structural dependencies and short-term expressive details.<n>The proposed model has been evaluated using the Maestro dataset and has demonstrated improvements in generating coherent and diverse music.
arXiv Detail & Related papers (2024-11-13T03:14:10Z) - SynthSOD: Developing an Heterogeneous Dataset for Orchestra Music Source Separation [7.428668206443388]
We introduce a novel multitrack dataset called SynthSOD, developed using a set of simulation techniques to create a realistic training set.<n>We demonstrate the application of a widely used baseline music separation model trained on our synthesized dataset w.r.t to the well-known EnsembleSet.
arXiv Detail & Related papers (2024-09-17T08:58:33Z) - MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music.
To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation)
Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z) - Multi-view MidiVAE: Fusing Track- and Bar-view Representations for Long
Multi-track Symbolic Music Generation [50.365392018302416]
We propose Multi-view MidiVAE, as one of the pioneers in VAE methods that effectively model and generate long multi-track symbolic music.
We focus on instrumental characteristics and harmony as well as global and local information about the musical composition by employing a hybrid variational encoding-decoding strategy.
arXiv Detail & Related papers (2024-01-15T08:41:01Z) - Hierarchical Recurrent Neural Networks for Conditional Melody Generation
with Long-term Structure [0.0]
We propose a conditional melody generation model based on a hierarchical recurrent neural network.
This model generates melodies with long-term structures based on given chord accompaniments.
Results from our listening test indicate that CM-HRNN outperforms AttentionRNN in terms of long-term structure and overall rating.
arXiv Detail & Related papers (2021-02-19T08:22:26Z) - PopMAG: Pop Music Accompaniment Generation [190.09996798215738]
We propose a novel MUlti-track MIDI representation (MuMIDI) which enables simultaneous multi-track generation in a single sequence.
MuMIDI enlarges the sequence length and brings the new challenge of long-term music modeling.
We call our system for pop music accompaniment generation as PopMAG.
arXiv Detail & Related papers (2020-08-18T02:28:36Z) - Modeling Musical Structure with Artificial Neural Networks [0.0]
I explore the application of artificial neural networks to different aspects of musical structure modeling.
I show how a connectionist model, the Gated Autoencoder (GAE), can be employed to learn transformations between musical fragments.
I propose a special predictive training of the GAE, which yields a representation of polyphonic music as a sequence of intervals.
arXiv Detail & Related papers (2020-01-06T18:35:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.