Multi-Source Music Generation with Latent Diffusion
- URL: http://arxiv.org/abs/2409.06190v3
- Date: Tue, 15 Oct 2024 19:17:33 GMT
- Title: Multi-Source Music Generation with Latent Diffusion
- Authors: Zhongweiyang Xu, Debottam Dutta, Yu-Lin Wei, Romit Roy Choudhury,
- Abstract summary: Multi-Source Diffusion Model (MSDM) proposed to model music as a mixture of multiple instrumental sources.
MSLDM employs Variational Autoencoders (VAEs) to encode each instrumental source into a distinct latent representation.
This approach significantly enhances the total and partial generation of music.
- Score: 7.832209959041259
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most music generation models directly generate a single music mixture. To allow for more flexible and controllable generation, the Multi-Source Diffusion Model (MSDM) has been proposed to model music as a mixture of multiple instrumental sources (e.g. piano, drums, bass, and guitar). Its goal is to use one single diffusion model to generate mutually-coherent music sources, that are then mixed to form the music. Despite its capabilities, MSDM is unable to generate music with rich melodies and often generates empty sounds. Its waveform diffusion approach also introduces significant Gaussian noise artifacts that compromise audio quality. In response, we introduce a Multi-Source Latent Diffusion Model (MSLDM) that employs Variational Autoencoders (VAEs) to encode each instrumental source into a distinct latent representation. By training a VAE on all music sources, we efficiently capture each source's unique characteristics in a "source latent." The source latents are concatenated and our diffusion model learns this joint latent space. This approach significantly enhances the total and partial generation of music by leveraging the VAE's latent compression and noise-robustness. The compressed source latent also facilitates more efficient generation. Subjective listening tests and Frechet Audio Distance (FAD) scores confirm that our model outperforms MSDM, showcasing its practical and enhanced applicability in music generation systems. We also emphasize that modeling sources is more effective than direct music mixture modeling. Codes and models are available at https://github.com/XZWY/MSLDM. Demos are available at https://xzwy.github.io/MSLDMDemo/.
Related papers
- Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z) - MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music.
To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation)
Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z) - Generalized Multi-Source Inference for Text Conditioned Music Diffusion Models [26.373204974010086]
Multi-Source Diffusion Models (MSDM) allow for compositional musical generation tasks.
This paper generalizes MSDM to arbitrary time-domain diffusion models conditioned on text embeddings.
We propose an inference procedure enabling the coherent generation of sources and accompaniments.
arXiv Detail & Related papers (2024-03-18T12:08:01Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - Efficient Neural Music Generation [42.39082326446739]
We present MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality.
MeLoDy inherits the highest-level LM from MusicLM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efficiently decode the conditioning semantic tokens into waveform.
arXiv Detail & Related papers (2023-05-25T05:02:35Z) - Noise2Music: Text-conditioned Music Generation with Diffusion Models [73.74580231353684]
We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts.
We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era.
Pretrained large language models play a key role in this story -- they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models.
arXiv Detail & Related papers (2023-02-08T07:27:27Z) - Mo\^usai: Text-to-Music Generation with Long-Context Latent Diffusion [27.567536688166776]
We bridge text and music via a text-to-music generation model that is highly efficient, expressive, and can handle long-term structure.
Specifically, we develop Mousai, a cascading two-stage latent diffusion model that can generate multiple minutes of high-quality stereo music at 48kHz from textual descriptions.
arXiv Detail & Related papers (2023-01-27T14:52:53Z) - Msanii: High Fidelity Music Synthesis on a Shoestring Budget [0.0]
We present Msanii, a novel diffusion-based model for synthesizing high-fidelity music efficiently.
Our model combines the synthesis of mel spectrograms, the generative capabilities of diffusion models, and the vocoding capabilities of neural vocoders.
arXiv Detail & Related papers (2023-01-16T15:18:26Z) - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z) - Unsupervised Source Separation By Steering Pretrained Music Models [15.847814664948013]
We showcase an unsupervised method that repurposes deep models trained for music generation and music tagging for audio source separation.
An audio generation model is conditioned on an input mixture, producing a latent encoding of the audio used to generate audio.
This generated audio is fed to a pretrained music tagger that creates source labels.
arXiv Detail & Related papers (2021-10-25T16:08:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.