Efficient Neural Music Generation
- URL: http://arxiv.org/abs/2305.15719v1
- Date: Thu, 25 May 2023 05:02:35 GMT
- Title: Efficient Neural Music Generation
- Authors: Max W. Y. Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu,
Yuliang Ji, Rui Xia, Mingbo Ma, Xuchen Song, Jitong Chen, Yuping Wang, Yuxuan
Wang
- Abstract summary: We present MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality.
MeLoDy inherits the highest-level LM from MusicLM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efficiently decode the conditioning semantic tokens into waveform.
- Score: 42.39082326446739
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent progress in music generation has been remarkably advanced by the
state-of-the-art MusicLM, which comprises a hierarchy of three LMs,
respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet,
sampling with the MusicLM requires processing through these LMs one by one to
obtain the fine-grained acoustic tokens, making it computationally expensive
and prohibitive for a real-time generation. Efficient music generation with a
quality on par with MusicLM remains a significant challenge. In this paper, we
present MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusion
model that generates music audios of state-of-the-art quality meanwhile
reducing 95.7% or 99.6% forward passes in MusicLM, respectively, for sampling
10s or 30s music. MeLoDy inherits the highest-level LM from MusicLM for
semantic modeling, and applies a novel dual-path diffusion (DPD) model and an
audio VAE-GAN to efficiently decode the conditioning semantic tokens into
waveform. DPD is proposed to simultaneously model the coarse and fine acoustics
by incorporating the semantic information into segments of latents effectively
via cross-attention at each denoising step. Our experimental results suggest
the superiority of MeLoDy, not only in its practical advantages on sampling
speed and infinitely continuable generation, but also in its state-of-the-art
musicality, audio quality, and text correlation.
Our samples are available at https://Efficient-MeLoDy.github.io/.
Related papers
- Multi-Source Music Generation with Latent Diffusion [7.832209959041259]
Multi-Source Diffusion Model (MSDM) proposed to model music as a mixture of multiple instrumental sources.
MSLDM employs Variational Autoencoders (VAEs) to encode each instrumental source into a distinct latent representation.
This approach significantly enhances the total and partial generation of music.
arXiv Detail & Related papers (2024-09-10T03:41:10Z) - Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models [0.0]
"Diff-A-Riff" is a Latent Diffusion Model designed to generate high-quality instrumentals adaptable to any musical context.
It produces 48kHz pseudo-stereo audio while significantly reducing inference time and memory usage.
arXiv Detail & Related papers (2024-06-12T16:34:26Z) - C3LLM: Conditional Multimodal Content Generation Using Large Language Models [66.11184017840688]
We introduce C3LLM, a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together.
C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities.
Our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model.
arXiv Detail & Related papers (2024-05-25T09:10:12Z) - QA-MDT: Quality-aware Masked Diffusion Transformer for Enhanced Music Generation [46.301388755267986]
We propose a novel paradigm for high-quality music generation that incorporates a quality-aware training strategy.
We first adapted and implemented a masked diffusion transformer (MDT) model for the TTM task, demonstrating its capacity for quality control and enhanced musicality.
Experiments demonstrate our state-of-the-art (SOTA) performance on MusicCaps and the Song-Describer dataset.
arXiv Detail & Related papers (2024-05-24T18:09:27Z) - MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music.
To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation)
Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z) - Masked Audio Generation using a Single Non-Autoregressive Transformer [90.11646612273965]
MAGNeT is a masked generative sequence modeling method that operates directly over several streams of audio tokens.
We demonstrate the efficiency of MAGNeT for the task of text-to-music and text-to-audio generation.
We shed light on the importance of each of the components comprising MAGNeT, together with pointing to the trade-offs between autoregressive and non-autoregressive modeling.
arXiv Detail & Related papers (2024-01-09T14:29:39Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - MSTRE-Net: Multistreaming Acoustic Modeling for Automatic Lyrics
Transcription [8.669338893753885]
This paper makes several contributions to automatic lyrics transcription (ALT) research.
Our main contribution is a novel variant of the Multistreaming Time-Delay Neural Network (MTDNN) architecture, called MSTRE-Net.
We present a new test set with a considerably larger size and a higher musical variability compared to the existing datasets used in ALT.
arXiv Detail & Related papers (2021-08-05T13:59:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.