JEN-1: Text-Guided Universal Music Generation with Omnidirectional
Diffusion Models
- URL: http://arxiv.org/abs/2308.04729v1
- Date: Wed, 9 Aug 2023 06:27:24 GMT
- Title: JEN-1: Text-Guided Universal Music Generation with Omnidirectional
Diffusion Models
- Authors: Peike Li, Boyu Chen, Yao Yao, Yikai Wang, Allen Wang, Alex Wang
- Abstract summary: This paper introduces JEN-1, a universal high-fidelity model for text-to-music generation.
JEN-1 is a diffusion model incorporating both autoregressive and non-autoregressive training.
Evaluations demonstrate JEN-1's superior performance over state-of-the-art methods in text-music alignment and music quality.
- Score: 16.18987351077676
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Music generation has attracted growing interest with the advancement of deep
generative models. However, generating music conditioned on textual
descriptions, known as text-to-music, remains challenging due to the complexity
of musical structures and high sampling rate requirements. Despite the task's
significance, prevailing generative models exhibit limitations in music
quality, computational efficiency, and generalization. This paper introduces
JEN-1, a universal high-fidelity model for text-to-music generation. JEN-1 is a
diffusion model incorporating both autoregressive and non-autoregressive
training. Through in-context learning, JEN-1 performs various generation tasks
including text-guided music generation, music inpainting, and continuation.
Evaluations demonstrate JEN-1's superior performance over state-of-the-art
methods in text-music alignment and music quality while maintaining
computational efficiency. Our demos are available at
http://futureverse.com/research/jen/demos/jen1
Related papers
- YuE: Scaling Open Foundation Models for Long-Form Music Generation [134.54174498094565]
YuE is a family of open foundation models based on the LLaMA2 architecture.
It generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment.
arXiv Detail & Related papers (2025-03-11T17:26:50Z) - InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation [43.690876909464336]
We introduce InspireMusic, a framework integrated super resolution and large language model for high-fidelity long-form music generation.
A unified framework generates high-fidelity music, songs, and audio, which incorporates an autoregressive transformer with a super-resolution flow-matching model.
Our model differs from previous approaches, as we utilize an audio tokenizer with one codebook that contains richer semantic information.
arXiv Detail & Related papers (2025-02-28T09:58:25Z) - MusicFlow: Cascaded Flow Matching for Text Guided Music Generation [53.63948108922333]
MusicFlow is a cascaded text-to-music generation model based on flow matching.
We leverage masked prediction as the training objective, enabling the model to generalize to other tasks such as music infilling and continuation.
arXiv Detail & Related papers (2024-10-27T15:35:41Z) - Symbolic Music Generation with Fine-grained Interactive Textural Guidance [13.052085651071135]
We introduce Fine-grained Textural Guidance (FTG) within diffusion models to correct errors in the learned distributions.
We derive theoretical characterizations for both the challenges in symbolic music generation and the effect of the FTG approach.
We provide a demo page for interactive music generation with user input to showcase the effectiveness of our approach.
arXiv Detail & Related papers (2024-10-11T00:41:46Z) - JEN-1 Composer: A Unified Framework for High-Fidelity Multi-Track Music
Generation [20.733264277770154]
JEN-1 Composer is a unified framework to efficiently model marginal, conditional, and joint distributions over multi-track music.
We introduce a curriculum training strategy aimed at incrementally instructing the model in the transition from single-track generation to the flexible generation of multi-track combinations.
We demonstrate state-of-the-art performance in controllable and high-fidelity multi-track music synthesis.
arXiv Detail & Related papers (2023-10-29T22:51:49Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models [67.66825818489406]
This paper introduces a text-to-waveform music generation model, underpinned by the utilization of diffusion models.
Our methodology hinges on the innovative incorporation of free-form textual prompts as conditional factors to guide the waveform generation process.
We demonstrate that our generated music in the waveform domain outperforms previous works by a large margin in terms of diversity, quality, and text-music relevance.
arXiv Detail & Related papers (2023-02-09T06:27:09Z) - Noise2Music: Text-conditioned Music Generation with Diffusion Models [73.74580231353684]
We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts.
We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era.
Pretrained large language models play a key role in this story -- they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models.
arXiv Detail & Related papers (2023-02-08T07:27:27Z) - Museformer: Transformer with Fine- and Coarse-Grained Attention for
Music Generation [138.74751744348274]
We propose Museformer, a Transformer with a novel fine- and coarse-grained attention for music generation.
Specifically, with the fine-grained attention, a token of a specific bar directly attends to all the tokens of the bars that are most relevant to music structures.
With the coarse-grained attention, a token only attends to the summarization of the other bars rather than each token of them so as to reduce the computational cost.
arXiv Detail & Related papers (2022-10-19T07:31:56Z) - A Comprehensive Survey on Deep Music Generation: Multi-level
Representations, Algorithms, Evaluations, and Future Directions [10.179835761549471]
This paper attempts to provide an overview of various composition tasks under different music generation levels using deep learning.
In addition, we summarize datasets suitable for diverse tasks, discuss the music representations, the evaluation methods as well as the challenges under different levels, and finally point out several future directions.
arXiv Detail & Related papers (2020-11-13T08:01:20Z) - SongNet: Rigid Formats Controlled Text Generation [51.428634666559724]
We propose a simple and elegant framework named SongNet to tackle this problem.
The backbone of the framework is a Transformer-based auto-regressive language model.
A pre-training and fine-tuning framework is designed to further improve the generation quality.
arXiv Detail & Related papers (2020-04-17T01:40:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.