Related papers: ImprovNet -- Generating Controllable Musical Improvisations with Iterative Corruption Refinement

ImprovNet -- Generating Controllable Musical Improvisations with Iterative Corruption Refinement

URL: http://arxiv.org/abs/2502.04522v4
Date: Fri, 16 May 2025 14:56:53 GMT
Title: ImprovNet -- Generating Controllable Musical Improvisations with Iterative Corruption Refinement
Authors: Keshav Bhandari, Sungkyun Chang, Tongyu Lu, Fareza R. Enus, Louis B. Bradshaw, Dorien Herremans, Simon Colton,
Abstract summary: ImprovNet is a transformer-based architecture that generates expressive and controllable musical improvisations.<n>It can perform cross-genre and intra-genre improvisations, harmonize melodies with genre-specific styles, and execute short prompt continuation and infilling tasks.
Score: 6.873190001575463
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite deep learning's remarkable advances in style transfer across various domains, generating controllable performance-level musical style transfer for complete symbolically represented musical works remains a challenging area of research. Much of this is owed to limited datasets, especially for genres such as jazz, and the lack of unified models that can handle multiple music generation tasks. This paper presents ImprovNet, a transformer-based architecture that generates expressive and controllable musical improvisations through a self-supervised corruption-refinement training strategy. The improvisational style transfer is aimed at making meaningful modifications to one or more musical elements - melody, harmony or rhythm of the original composition with respect to the target genre. ImprovNet unifies multiple capabilities within a single model: it can perform cross-genre and intra-genre improvisations, harmonize melodies with genre-specific styles, and execute short prompt continuation and infilling tasks. The model's iterative generation framework allows users to control the degree of style transfer and structural similarity to the original composition. Objective and subjective evaluations demonstrate ImprovNet's effectiveness in generating musically coherent improvisations while maintaining structural relationships with the original pieces. The model outperforms Anticipatory Music Transformer in short continuation and infilling tasks and successfully achieves recognizable genre conversion, with 79\% of participants correctly identifying jazz-style improvisations of classical pieces. Our code and demo page can be found at https://github.com/keshavbhandari/improvnet.

Related papers

From Generality to Mastery: Composer-Style Symbolic Music Generation via Large-Scale Pre-training [4.7205815347741185]
We investigate how general music knowledge learned from a broad corpus can enhance the mastery of specific composer styles.<n>First, we pre-train a REMI-based music generation model on a large corpus of pop, folk, and classical music.<n>Then, we fine-tune it on a small, human-verified dataset from four renowned composers, namely Bach, Mozart, Beethoven, and Chopin.
arXiv Detail & Related papers (2025-06-20T22:20:59Z)
Adaptive Accompaniment with ReaLchords [60.690020661819055]
We propose ReaLchords, an online generative model for improvising chord accompaniment to user melody.<n>We start with an online model pretrained by maximum likelihood, and use reinforcement learning to finetune the model for online use.
arXiv Detail & Related papers (2025-06-17T16:59:05Z)
Analyzable Chain-of-Musical-Thought Prompting for High-Fidelity Music Generation [10.643965544581683]
We introduce MusiCoT, a novel chain-of-thought (CoT) prompting technique tailored for music generation. MusiCoT empowers the AR model to first outline an overall music structure before generating audio tokens. Our experimental results indicate that MusiCoT consistently achieves superior performance across both objective and subjective metrics.
arXiv Detail & Related papers (2025-03-25T12:51:21Z)
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation [75.86473375730392]
SongGen is a fully open-source, single-stage auto-regressive transformer for controllable song generation. It supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline.
arXiv Detail & Related papers (2025-02-18T18:52:21Z)
UniMuMo: Unified Text, Music and Motion Generation [57.72514622935806]
We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities. By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture.
arXiv Detail & Related papers (2024-10-06T16:04:05Z)
Combining audio control and style transfer using latent diffusion [1.705371629600151]
In this paper, we aim to unify explicit control and style transfer within a single model. Our model can generate audio matching a timbre target, while specifying structure either with explicit controls or through another audio example. We show that our method can generate cover versions of complete musical pieces by transferring rhythmic and melodic content to the style of a target audio in a different genre.
arXiv Detail & Related papers (2024-07-31T23:27:27Z)
MuseBarControl: Enhancing Fine-Grained Control in Symbolic Music Generation through Pre-Training and Counterfactual Loss [51.85076222868963]
We introduce a pre-training task designed to link control signals directly with corresponding musical tokens. We then implement a novel counterfactual loss that promotes better alignment between the generated music and the control prompts.
arXiv Detail & Related papers (2024-07-05T08:08:22Z)
JEN-1 Composer: A Unified Framework for High-Fidelity Multi-Track Music Generation [18.979064278674276]
JEN-1 Composer is designed to efficiently model marginal, conditional, and joint distributions over multi-track music.<n>We introduce a progressive curriculum training strategy, which gradually escalates the difficulty of training tasks.<n>Our approach demonstrates state-of-the-art performance in controllable and high-fidelity multi-track music synthesis.
arXiv Detail & Related papers (2023-10-29T22:51:49Z)
Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z)
GETMusic: Generating Any Music Tracks with a Unified Representation and Diffusion Framework [58.64512825534638]
Symbolic music generation aims to create musical notes, which can help users compose music. We introduce a framework known as GETMusic, with GET'' standing for GEnerate music Tracks'' GETScore represents musical notes as tokens and organizes tokens in a 2D structure, with tracks stacked vertically and progressing horizontally over time. Our proposed representation, coupled with the non-autoregressive generative model, empowers GETMusic to generate music with any arbitrary source-target track combinations.
arXiv Detail & Related papers (2023-05-18T09:53:23Z)
Multi-Genre Music Transformer -- Composing Full Length Musical Piece [0.0]
The objective of the project is to implement a Multi-Genre Transformer which learns to produce music pieces through more adaptive learning process. We built a multi-genre compound word dataset, implemented a linear transformer which was trained on this dataset. We call this Multi-Genre Transformer, which was able to generate full length new musical pieces which is diverse and comparable to original tracks.
arXiv Detail & Related papers (2023-01-06T05:27:55Z)
Compose & Embellish: Well-Structured Piano Performance Generation via A Two-Stage Approach [36.49582705724548]
We devise a two-stage Transformer-based framework that Composes a lead sheet first, and then Embellishes it with accompaniment and expressive touches. Our objective and subjective experiments show that Compose & Embellish shrinks the gap in structureness between a current state of the art and real performances by half, and improves other musical aspects such as richness and coherence as well.
arXiv Detail & Related papers (2022-09-17T01:20:59Z)
The Power of Reuse: A Multi-Scale Transformer Model for Structural Dynamic Segmentation in Symbolic Music Generation [6.0949335132843965]
Symbolic Music Generation relies on the contextual representation capabilities of the generative model. We propose a multi-scale Transformer, which uses coarse-decoder and fine-decoders to model the contexts at the global and section-level. Our model is evaluated on two open MIDI datasets, and experiments show that our model outperforms the best contemporary symbolic music generative models.
arXiv Detail & Related papers (2022-05-17T18:48:14Z)
SeCo: Separating Unknown Musical Visual Sounds with Consistency Guidance [88.0355290619761]
This work focuses on the separation of unknown musical instruments. We propose the Separation-with-Consistency (SeCo) framework, which can accomplish the separation on unknown categories. Our framework exhibits strong adaptation ability on the novel musical categories and outperforms the baseline methods by a significant margin.
arXiv Detail & Related papers (2022-03-25T09:42:11Z)
Controllable deep melody generation via hierarchical music structure representation [14.891975420982511]
MusicFrameworks is a hierarchical music structure representation and a multi-step generative process to create a full-length melody. To generate melody in each phrase, we generate rhythm and basic melody using two separate transformer-based networks. To customize or add variety, one can alter chords, basic melody, and rhythm structure in the music frameworks, letting our networks generate the melody accordingly.
arXiv Detail & Related papers (2021-09-02T01:31:14Z)
Learning Interpretable Representation for Controllable Polyphonic Music Generation [5.01266258109807]
We design a novel architecture, that effectively learns two interpretable latent factors of polyphonic music: chord and texture. We show that such chord-texture disentanglement provides a controllable generation pathway leading to a wide spectrum of applications.
arXiv Detail & Related papers (2020-08-17T07:11:16Z)
Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music. We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.