InstructME: An Instruction Guided Music Edit And Remix Framework with
Latent Diffusion Models
- URL: http://arxiv.org/abs/2308.14360v3
- Date: Tue, 12 Dec 2023 06:55:08 GMT
- Title: InstructME: An Instruction Guided Music Edit And Remix Framework with
Latent Diffusion Models
- Authors: Bing Han, Junyu Dai, Weituo Hao, Xinyan He, Dong Guo, Jitong Chen,
Yuxuan Wang, Yanmin Qian and Xuchen Song
- Abstract summary: In this paper, we develop InstructME, an Instruction guided Music Editing and remixing framework based on latent diffusion models.
Our framework fortifies the U-Net with multi-scale aggregation in order to maintain consistency before and after editing.
Our proposed method significantly surpasses preceding systems in music quality, text relevance and harmony.
- Score: 42.2977676825086
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Music editing primarily entails the modification of instrument tracks or
remixing in the whole, which offers a novel reinterpretation of the original
piece through a series of operations. These music processing methods hold
immense potential across various applications but demand substantial expertise.
Prior methodologies, although effective for image and audio modifications,
falter when directly applied to music. This is attributed to music's
distinctive data nature, where such methods can inadvertently compromise the
intrinsic harmony and coherence of music. In this paper, we develop InstructME,
an Instruction guided Music Editing and remixing framework based on latent
diffusion models. Our framework fortifies the U-Net with multi-scale
aggregation in order to maintain consistency before and after editing. In
addition, we introduce chord progression matrix as condition information and
incorporate it in the semantic space to improve melodic harmony while editing.
For accommodating extended musical pieces, InstructME employs a chunk
transformer, enabling it to discern long-term temporal dependencies within
music sequences. We tested InstructME in instrument-editing, remixing, and
multi-round editing. Both subjective and objective evaluations indicate that
our proposed method significantly surpasses preceding systems in music quality,
text relevance and harmony. Demo samples are available at
https://musicedit.github.io/
Related papers
- Melody Is All You Need For Music Generation [10.366088659024685]
We present the Melody Guided Music Generation (MMGen) model, the first novel approach using melody to guide the music generation.
Specifically, we first align the melody with audio waveforms and their associated descriptions using the multimodal alignment module.
This allows MMGen to generate music that matches the style of the provided audio while also producing music that reflects the content of the given text description.
arXiv Detail & Related papers (2024-09-30T11:13:35Z) - Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning [24.6866990804501]
Instruct-MusicGen is a novel approach that finetunes a pretrained MusicGen model to efficiently follow editing instructions.
Remarkably, Instruct-MusicGen only introduces 8% new parameters to the original MusicGen model and only trains for 5K steps.
arXiv Detail & Related papers (2024-05-28T17:27:20Z) - MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music.
To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation)
Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z) - Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based Controls [6.176747724853209]
Large Language Models (LLMs) have shown promise in generating high-quality music, but their focus on autoregressive generation limits their utility in music editing tasks.
We propose a novel approach leveraging a parameter-efficient heterogeneous adapter combined with a masking training scheme.
Our method integrates frame-level content-based controls, facilitating track-conditioned music refinement and score-conditioned music arrangement.
arXiv Detail & Related papers (2024-02-14T19:00:01Z) - Performance Conditioning for Diffusion-Based Multi-Instrument Music
Synthesis [15.670399197114012]
We propose enhancing control of multi-instrument synthesis by conditioning a generative model on a specific performance and recording environment.
Performance conditioning is a tool indicating the generative model to synthesize music with style and timbre of specific instruments taken from specific performances.
Our prototype is evaluated using uncurated performances with diverse instrumentation and state-of-the-art FAD realism scores.
arXiv Detail & Related papers (2023-09-21T17:44:57Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - RMSSinger: Realistic-Music-Score based Singing Voice Synthesis [56.51475521778443]
RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types.
We propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input.
In RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment.
arXiv Detail & Related papers (2023-05-18T03:57:51Z) - Controllable deep melody generation via hierarchical music structure
representation [14.891975420982511]
MusicFrameworks is a hierarchical music structure representation and a multi-step generative process to create a full-length melody.
To generate melody in each phrase, we generate rhythm and basic melody using two separate transformer-based networks.
To customize or add variety, one can alter chords, basic melody, and rhythm structure in the music frameworks, letting our networks generate the melody accordingly.
arXiv Detail & Related papers (2021-09-02T01:31:14Z) - PopMAG: Pop Music Accompaniment Generation [190.09996798215738]
We propose a novel MUlti-track MIDI representation (MuMIDI) which enables simultaneous multi-track generation in a single sequence.
MuMIDI enlarges the sequence length and brings the new challenge of long-term music modeling.
We call our system for pop music accompaniment generation as PopMAG.
arXiv Detail & Related papers (2020-08-18T02:28:36Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.