Related papers: Mamba-Diffusion Model with Learnable Wavelet for Controllable Symbolic Music Generation

Mamba-Diffusion Model with Learnable Wavelet for Controllable Symbolic Music Generation

URL: http://arxiv.org/abs/2505.03314v1
Date: Tue, 06 May 2025 08:44:52 GMT
Title: Mamba-Diffusion Model with Learnable Wavelet for Controllable Symbolic Music Generation
Authors: Jincheng Zhang, György Fazekas, Charalampos Saitis,
Abstract summary: We represent symbolic music as image-like pianorolls, facilitating the use of diffusion models for the generation of symbolic music.<n>This study introduces a novel diffusion model that incorporates our proposed Transformer-Mamba block and learnable wavelet transform.<n>Our evaluation shows that our method achieves compelling results in terms of music quality and controllability.
Score: 5.083504224028769
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recent surge in the popularity of diffusion models for image synthesis has attracted new attention to their potential for generation tasks in other domains. However, their applications to symbolic music generation remain largely under-explored because symbolic music is typically represented as sequences of discrete events and standard diffusion models are not well-suited for discrete data. We represent symbolic music as image-like pianorolls, facilitating the use of diffusion models for the generation of symbolic music. Moreover, this study introduces a novel diffusion model that incorporates our proposed Transformer-Mamba block and learnable wavelet transform. Classifier-free guidance is utilised to generate symbolic music with target chords. Our evaluation shows that our method achieves compelling results in terms of music quality and controllability, outperforming the strong baseline in pianoroll generation. Our code is available at https://github.com/jinchengzhanggg/proffusion.

Related papers

Scaling Self-Supervised Representation Learning for Symbolic Piano Performance [52.661197827466886]
We study the capabilities of generative autoregressive transformer models trained on large amounts of symbolic solo-piano transcriptions.<n>We use a comparatively smaller, high-quality subset to finetune models to produce musical continuations, perform symbolic classification tasks, and produce general-purpose contrastive MIDI embeddings.
arXiv Detail & Related papers (2025-06-30T14:00:14Z)
MusicFlow: Cascaded Flow Matching for Text Guided Music Generation [53.63948108922333]
MusicFlow is a cascaded text-to-music generation model based on flow matching. We leverage masked prediction as the training objective, enabling the model to generalize to other tasks such as music infilling and continuation.
arXiv Detail & Related papers (2024-10-27T15:35:41Z)
Efficient Fine-Grained Guidance for Diffusion-Based Symbolic Music Generation [14.156461396686248]
We introduce an efficient Fine-Grained Guidance (FGG) approach within diffusion models.<n>FGG guides the diffusion models to generate music that aligns more closely with the control and intent of expert composers.<n>This approach empowers diffusion models to excel in advanced applications such as improvisation, and interactive music creation.
arXiv Detail & Related papers (2024-10-11T00:41:46Z)
MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [57.47799823804519]
We are inspired by how musicians compose music not just from a movie script, but also through visualizations. We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music.
arXiv Detail & Related papers (2024-06-07T06:38:59Z)
MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music. To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation) Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z)
Symbolic Music Generation with Non-Differentiable Rule Guided Diffusion [32.961767438163676]
Musical rules are often expressed in symbolic form on note characteristics, such as note density or chord progression. We propose Control Guidance (SCG), a novel guidance method that only requires forward evaluation of rule functions. SCG achieves training-free guidance for non-differentiable rules for the first time.
arXiv Detail & Related papers (2024-02-22T04:55:58Z)
Composer Style-specific Symbolic Music Generation Using Vector Quantized Discrete Diffusion Models [5.083504224028769]
We propose to combine a vector quantized variational autoencoder (VQ-VAE) and discrete diffusion models for the generation of symbolic music. The trained VQ-VAE can represent symbolic music as a sequence of indexes that correspond to specific entries in a learned codebook. The diffusion model is trained to generate intermediate music sequences consisting of codebook indexes, which are then decoded to symbolic music using the VQ-VAE's decoder.
arXiv Detail & Related papers (2023-10-21T15:41:50Z)
Fast Diffusion GAN Model for Symbolic Music Generation Controlled by Emotions [1.6004393678882072]
We propose a diffusion model combined with a Generative Adversarial Network to generate discrete symbolic music. We first used a trained Variational Autoencoder to obtain embeddings of a symbolic music dataset with emotion labels. Our results demonstrate the successful control of our diffusion model to generate symbolic music with a desired emotion.
arXiv Detail & Related papers (2023-10-21T15:35:43Z)
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations. These models are prone to generate audible artifacts when the conditioning is flawed or imperfect. We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z)
Noise2Music: Text-conditioned Music Generation with Diffusion Models [73.74580231353684]
We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era. Pretrained large language models play a key role in this story -- they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models.
arXiv Detail & Related papers (2023-02-08T07:27:27Z)
The Power of Reuse: A Multi-Scale Transformer Model for Structural Dynamic Segmentation in Symbolic Music Generation [6.0949335132843965]
Symbolic Music Generation relies on the contextual representation capabilities of the generative model. We propose a multi-scale Transformer, which uses coarse-decoder and fine-decoders to model the contexts at the global and section-level. Our model is evaluated on two open MIDI datasets, and experiments show that our model outperforms the best contemporary symbolic music generative models.
arXiv Detail & Related papers (2022-05-17T18:48:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.