ImprovNet: Generating Controllable Musical Improvisations with Iterative Corruption Refinement
- URL: http://arxiv.org/abs/2502.04522v1
- Date: Thu, 06 Feb 2025 21:45:38 GMT
- Title: ImprovNet: Generating Controllable Musical Improvisations with Iterative Corruption Refinement
- Authors: Keshav Bhandari, Sungkyun Chang, Tongyu Lu, Fareza R. Enus, Louis B. Bradshaw, Dorien Herremans, Simon Colton,
- Abstract summary: ImprovNet is a transformer-based architecture that generates expressive and controllable musical improvisations.
It can perform cross-genre and intra-genre improvisations, harmonize melodies with genre-specific styles, and execute short prompt continuation and infilling tasks.
- Score: 6.873190001575463
- License:
- Abstract: Deep learning has enabled remarkable advances in style transfer across various domains, offering new possibilities for creative content generation. However, in the realm of symbolic music, generating controllable and expressive performance-level style transfers for complete musical works remains challenging due to limited datasets, especially for genres such as jazz, and the lack of unified models that can handle multiple music generation tasks. This paper presents ImprovNet, a transformer-based architecture that generates expressive and controllable musical improvisations through a self-supervised corruption-refinement training strategy. ImprovNet unifies multiple capabilities within a single model: it can perform cross-genre and intra-genre improvisations, harmonize melodies with genre-specific styles, and execute short prompt continuation and infilling tasks. The model's iterative generation framework allows users to control the degree of style transfer and structural similarity to the original composition. Objective and subjective evaluations demonstrate ImprovNet's effectiveness in generating musically coherent improvisations while maintaining structural relationships with the original pieces. The model outperforms Anticipatory Music Transformer in short continuation and infilling tasks and successfully achieves recognizable genre conversion, with 79\% of participants correctly identifying jazz-style improvisations. Our code and demo page can be found at https://github.com/keshavbhandari/improvnet.
Related papers
- SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation [75.86473375730392]
SongGen is a fully open-source, single-stage auto-regressive transformer for controllable song generation.
It supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately.
To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline.
arXiv Detail & Related papers (2025-02-18T18:52:21Z) - UniMuMo: Unified Text, Music and Motion Generation [57.72514622935806]
We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities.
By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture.
arXiv Detail & Related papers (2024-10-06T16:04:05Z) - Combining audio control and style transfer using latent diffusion [1.705371629600151]
In this paper, we aim to unify explicit control and style transfer within a single model.
Our model can generate audio matching a timbre target, while specifying structure either with explicit controls or through another audio example.
We show that our method can generate cover versions of complete musical pieces by transferring rhythmic and melodic content to the style of a target audio in a different genre.
arXiv Detail & Related papers (2024-07-31T23:27:27Z) - MuseBarControl: Enhancing Fine-Grained Control in Symbolic Music Generation through Pre-Training and Counterfactual Loss [51.85076222868963]
We introduce a pre-training task designed to link control signals directly with corresponding musical tokens.
We then implement a novel counterfactual loss that promotes better alignment between the generated music and the control prompts.
arXiv Detail & Related papers (2024-07-05T08:08:22Z) - JEN-1 Composer: A Unified Framework for High-Fidelity Multi-Track Music Generation [18.979064278674276]
JEN-1 Composer is designed to efficiently model marginal, conditional, and joint distributions over multi-track music.
We introduce a progressive curriculum training strategy, which gradually escalates the difficulty of training tasks.
Our approach demonstrates state-of-the-art performance in controllable and high-fidelity multi-track music synthesis.
arXiv Detail & Related papers (2023-10-29T22:51:49Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - Multi-Genre Music Transformer -- Composing Full Length Musical Piece [0.0]
The objective of the project is to implement a Multi-Genre Transformer which learns to produce music pieces through more adaptive learning process.
We built a multi-genre compound word dataset, implemented a linear transformer which was trained on this dataset.
We call this Multi-Genre Transformer, which was able to generate full length new musical pieces which is diverse and comparable to original tracks.
arXiv Detail & Related papers (2023-01-06T05:27:55Z) - The Power of Reuse: A Multi-Scale Transformer Model for Structural
Dynamic Segmentation in Symbolic Music Generation [6.0949335132843965]
Symbolic Music Generation relies on the contextual representation capabilities of the generative model.
We propose a multi-scale Transformer, which uses coarse-decoder and fine-decoders to model the contexts at the global and section-level.
Our model is evaluated on two open MIDI datasets, and experiments show that our model outperforms the best contemporary symbolic music generative models.
arXiv Detail & Related papers (2022-05-17T18:48:14Z) - SeCo: Separating Unknown Musical Visual Sounds with Consistency Guidance [88.0355290619761]
This work focuses on the separation of unknown musical instruments.
We propose the Separation-with-Consistency (SeCo) framework, which can accomplish the separation on unknown categories.
Our framework exhibits strong adaptation ability on the novel musical categories and outperforms the baseline methods by a significant margin.
arXiv Detail & Related papers (2022-03-25T09:42:11Z) - Learning Interpretable Representation for Controllable Polyphonic Music
Generation [5.01266258109807]
We design a novel architecture, that effectively learns two interpretable latent factors of polyphonic music: chord and texture.
We show that such chord-texture disentanglement provides a controllable generation pathway leading to a wide spectrum of applications.
arXiv Detail & Related papers (2020-08-17T07:11:16Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.