JEN-1 Composer: A Unified Framework for High-Fidelity Multi-Track Music Generation
- URL: http://arxiv.org/abs/2310.19180v4
- Date: Tue, 17 Dec 2024 04:08:33 GMT
- Title: JEN-1 Composer: A Unified Framework for High-Fidelity Multi-Track Music Generation
- Authors: Yao Yao, Peike Li, Boyu Chen, Alex Wang,
- Abstract summary: JEN-1 Composer is designed to efficiently model marginal, conditional, and joint distributions over multi-track music.
We introduce a progressive curriculum training strategy, which gradually escalates the difficulty of training tasks.
Our approach demonstrates state-of-the-art performance in controllable and high-fidelity multi-track music synthesis.
- Score: 18.979064278674276
- License:
- Abstract: With rapid advances in generative artificial intelligence, the text-to-music synthesis task has emerged as a promising direction for music generation. Nevertheless, achieving precise control over multi-track generation remains an open challenge. While existing models excel in directly generating multi-track mix, their limitations become evident when it comes to composing individual tracks and integrating them in a controllable manner. This departure from the typical workflows of professional composers hinders the ability to refine details in specific tracks. To address this gap, we propose JEN-1 Composer, a unified framework designed to efficiently model marginal, conditional, and joint distributions over multi-track music using a single model. Building upon an audio latent diffusion model, JEN-1 Composer extends the versatility of multi-track music generation. We introduce a progressive curriculum training strategy, which gradually escalates the difficulty of training tasks while ensuring the model's generalization ability and facilitating smooth transitions between different scenarios. During inference, users can iteratively generate and select music tracks, thus incrementally composing entire musical pieces in accordance with the Human-AI co-composition workflow. Our approach demonstrates state-of-the-art performance in controllable and high-fidelity multi-track music synthesis, marking a significant advancement in interactive AI-assisted music creation. Our demo pages are available at www.jenmusic.ai/research.
Related papers
- SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation [75.86473375730392]
SongGen is a fully open-source, single-stage auto-regressive transformer for controllable song generation.
It supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately.
To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline.
arXiv Detail & Related papers (2025-02-18T18:52:21Z) - UniMuMo: Unified Text, Music and Motion Generation [57.72514622935806]
We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities.
By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture.
arXiv Detail & Related papers (2024-10-06T16:04:05Z) - SynthSOD: Developing an Heterogeneous Dataset for Orchestra Music Source Separation [7.428668206443388]
We introduce a novel multitrack dataset called SynthSOD, developed using a set of simulation techniques to create a realistic training set.
We demonstrate the application of a widely used baseline music separation model trained on our synthesized dataset w.r.t to the well-known EnsembleSet.
arXiv Detail & Related papers (2024-09-17T08:58:33Z) - Unlocking Potential in Pre-Trained Music Language Models for Versatile Multi-Track Music Arrangement [10.714947060480426]
We propose a unified sequence-to-sequence framework that enables the fine-tuning of a symbolic music language model.
Our experiments demonstrate that the proposed approach consistently achieves higher musical quality compared to task-specific baselines.
arXiv Detail & Related papers (2024-08-27T16:18:51Z) - MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music.
To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation)
Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z) - Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based Controls [6.176747724853209]
Large Language Models (LLMs) have shown promise in generating high-quality music, but their focus on autoregressive generation limits their utility in music editing tasks.
We propose a novel approach leveraging a parameter-efficient heterogeneous adapter combined with a masking training scheme.
Our method integrates frame-level content-based controls, facilitating track-conditioned music refinement and score-conditioned music arrangement.
arXiv Detail & Related papers (2024-02-14T19:00:01Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - GETMusic: Generating Any Music Tracks with a Unified Representation and
Diffusion Framework [58.64512825534638]
Symbolic music generation aims to create musical notes, which can help users compose music.
We introduce a framework known as GETMusic, with GET'' standing for GEnerate music Tracks''
GETScore represents musical notes as tokens and organizes tokens in a 2D structure, with tracks stacked vertically and progressing horizontally over time.
Our proposed representation, coupled with the non-autoregressive generative model, empowers GETMusic to generate music with any arbitrary source-target track combinations.
arXiv Detail & Related papers (2023-05-18T09:53:23Z) - Quantized GAN for Complex Music Generation from Dance Videos [48.196705493763986]
We present Dance2Music-GAN (D2M-GAN), a novel adversarial multi-modal framework that generates musical samples conditioned on dance videos.
Our proposed framework takes dance video frames and human body motion as input, and learns to generate music samples that plausibly accompany the corresponding input.
arXiv Detail & Related papers (2022-04-01T17:53:39Z) - MMM : Exploring Conditional Multi-Track Music Generation with the
Transformer [9.569049935824227]
We propose a generative system based on the Transformer architecture that is capable of generating multi-track music.
We create a time-ordered sequence of musical events for each track and several tracks into a single sequence.
This takes advantage of the Transformer's attention-mechanism, which can adeptly handle long-term dependencies.
arXiv Detail & Related papers (2020-08-13T02:36:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.