Related papers: Efficient Fine-Grained Guidance for Diffusion-Based Symbolic Music Generation

Efficient Fine-Grained Guidance for Diffusion-Based Symbolic Music Generation

URL: http://arxiv.org/abs/2410.08435v2
Date: Sun, 02 Feb 2025 19:01:07 GMT
Title: Efficient Fine-Grained Guidance for Diffusion-Based Symbolic Music Generation
Authors: Tingyu Zhu, Haoyu Liu, Ziyu Wang, Zhimin Jiang, Zeyu Zheng,
Abstract summary: We introduce an efficient Fine-Grained Guidance (FGG) approach within diffusion models.<n>FGG guides the diffusion models to generate music that aligns more closely with the control and intent of expert composers.<n>This approach empowers diffusion models to excel in advanced applications such as improvisation, and interactive music creation.
Score: 14.156461396686248
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Developing generative models to create or conditionally create symbolic music presents unique challenges due to the combination of limited data availability and the need for high precision in note pitch. To address these challenges, we introduce an efficient Fine-Grained Guidance (FGG) approach within diffusion models. FGG guides the diffusion models to generate music that aligns more closely with the control and intent of expert composers, which is critical to improve the accuracy, listenability, and quality of generated music. This approach empowers diffusion models to excel in advanced applications such as improvisation, and interactive music creation. We derive theoretical characterizations for both the challenges in symbolic music generation and the effects of the FGG approach. We provide numerical experiments and subjective evaluation to demonstrate the effectiveness of our approach. We have published a demo page to showcase performances, as one of the first in the symbolic music literature's demo pages that enables real-time interactive generation.

Related papers

EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing [54.10773655199149]
We investigate leveraging cross-attention control for efficient audio editing within auto-regressive models.<n>Inspired by image editing methodologies, we develop a Prompt-to-Prompt-like approach that guides edits through cross and self-attention mechanisms.
arXiv Detail & Related papers (2025-07-15T08:44:11Z)
MotionRAG-Diff: A Retrieval-Augmented Diffusion Framework for Long-Term Music-to-Dance Generation [10.203209816178552]
MotionRAG-Diff is a hybrid framework that integrates Retrieval-Augmented Generation and diffusion-based refinement.<n>Our method introduces three core innovations.<n>It achieves state-of-the-art performance in motion quality, diversity, and music-motion synchronization accuracy.
arXiv Detail & Related papers (2025-06-03T09:12:48Z)
QA-MDT: Quality-aware Masked Diffusion Transformer for Enhanced Music Generation [46.301388755267986]
We propose a novel paradigm for high-quality music generation that incorporates a quality-aware training strategy. We first adapted and implemented a masked diffusion transformer (MDT) model for the TTM task, demonstrating its capacity for quality control and enhanced musicality. Experiments demonstrate our state-of-the-art (SOTA) performance on MusicCaps and the Song-Describer dataset.
arXiv Detail & Related papers (2024-05-24T18:09:27Z)
Music Consistency Models [31.415900049111023]
We present Music Consistency Models (textttMusicCM), which leverages the concept of consistency models to efficiently synthesize mel-spectrogram for music clips. Building upon existing text-to-music diffusion models, the textttMusicCM model incorporates consistency distillation and adversarial discriminator training. Experimental results reveal the effectiveness of our model in terms of computational efficiency, fidelity, and naturalness.
arXiv Detail & Related papers (2024-04-20T11:52:30Z)
MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music. To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation) Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z)
MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models [24.582948932985726]
This paper introduces a novel approach to the editing of music generated by text-to-music models. Our method transforms text editing to textitlatent space manipulation while adding an extra constraint to enforce consistency. Experimental results demonstrate superior performance over both zero-shot and certain supervised baselines in style and timbre transfer evaluations.
arXiv Detail & Related papers (2024-02-09T04:34:08Z)
DITTO: Diffusion Inference-Time T-Optimization for Music Generation [49.90109850026932]
Diffusion Inference-Time T-Optimization (DITTO) is a frame-work for controlling pre-trained text-to-music diffusion models at inference-time. We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control.
arXiv Detail & Related papers (2024-01-22T18:10:10Z)
Fast Diffusion GAN Model for Symbolic Music Generation Controlled by Emotions [1.6004393678882072]
We propose a diffusion model combined with a Generative Adversarial Network to generate discrete symbolic music. We first used a trained Variational Autoencoder to obtain embeddings of a symbolic music dataset with emotion labels. Our results demonstrate the successful control of our diffusion model to generate symbolic music with a desired emotion.
arXiv Detail & Related papers (2023-10-21T15:35:43Z)
Performance Conditioning for Diffusion-Based Multi-Instrument Music Synthesis [15.670399197114012]
We propose enhancing control of multi-instrument synthesis by conditioning a generative model on a specific performance and recording environment. Performance conditioning is a tool indicating the generative model to synthesize music with style and timbre of specific instruments taken from specific performances. Our prototype is evaluated using uncurated performances with diverse instrumentation and state-of-the-art FAD realism scores.
arXiv Detail & Related papers (2023-09-21T17:44:57Z)
DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation [89.50310360658791]
We present a novel cascaded motion diffusion model, DiffDance, designed for high-resolution, long-form dance generation. This model comprises a music-to-dance diffusion model and a sequence super-resolution diffusion model. We demonstrate that DiffDance is capable of generating realistic dance sequences that align effectively with the input music.
arXiv Detail & Related papers (2023-08-05T16:18:57Z)
Taming Diffusion Models for Music-driven Conducting Motion Generation [1.0624606551524207]
This paper presents Diffusion-Conductor, a novel DDIM-based approach for music-driven conducting motion generation. We propose a random masking strategy to improve the feature robustness, and use a pair of geometric loss functions to impose additional regularizations. We also design several novel metrics, including Frechet Gesture Distance (FGD) and Beat Consistency Score (BC) for a more comprehensive evaluation of the generated motion.
arXiv Detail & Related papers (2023-06-15T03:49:24Z)
ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models [67.66825818489406]
This paper introduces a text-to-waveform music generation model, underpinned by the utilization of diffusion models. Our methodology hinges on the innovative incorporation of free-form textual prompts as conditional factors to guide the waveform generation process. We demonstrate that our generated music in the waveform domain outperforms previous works by a large margin in terms of diversity, quality, and text-music relevance.
arXiv Detail & Related papers (2023-02-09T06:27:09Z)
Generating music with sentiment using Transformer-GANs [0.0]
We propose a generative model of symbolic music conditioned by data retrieved from human sentiment. We try to tackle both of the problems above by employing an efficient linear version of Attention and using a Discriminator.
arXiv Detail & Related papers (2022-12-21T15:59:35Z)
Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions. We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation. Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z)
Quantized GAN for Complex Music Generation from Dance Videos [48.196705493763986]
We present Dance2Music-GAN (D2M-GAN), a novel adversarial multi-modal framework that generates musical samples conditioned on dance videos. Our proposed framework takes dance video frames and human body motion as input, and learns to generate music samples that plausibly accompany the corresponding input.
arXiv Detail & Related papers (2022-04-01T17:53:39Z)
Music-to-Dance Generation with Optimal Transport [48.92483627635586]
We propose a Music-to-Dance with Optimal Transport Network (MDOT-Net) for learning to generate 3D dance choreographs from music. We introduce an optimal transport distance for evaluating the authenticity of the generated dance distribution and a Gromov-Wasserstein distance to measure the correspondence between the dance distribution and the input music.
arXiv Detail & Related papers (2021-12-03T09:37:26Z)
Learning Style-Aware Symbolic Music Representations by Adversarial Autoencoders [9.923470453197657]
We focus on leveraging adversarial regularization as a flexible and natural mean to imbue variational autoencoders with context information. We introduce the first Music Adversarial Autoencoder (MusAE) Our model has a higher reconstruction accuracy than state-of-the-art models based on standard variational autoencoders.
arXiv Detail & Related papers (2020-01-15T18:07:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.