MusicLDM: Enhancing Novelty in Text-to-Music Generation Using
Beat-Synchronous Mixup Strategies
- URL: http://arxiv.org/abs/2308.01546v1
- Date: Thu, 3 Aug 2023 05:35:37 GMT
- Title: MusicLDM: Enhancing Novelty in Text-to-Music Generation Using
Beat-Synchronous Mixup Strategies
- Authors: Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor
Berg-Kirkpatrick, Shlomo Dubnov
- Abstract summary: We build a state-of-the-art text-to-music model, MusicLDM, that adapts Stable Diffusion and AudioLDM architectures to the music domain.
We propose two different mixup strategies for data augmentation: beat-synchronous audio mixup and beat-synchronous latent mixup.
In addition to popular evaluation metrics, we design several new evaluation metrics based on CLAP score to demonstrate that our proposed MusicLDM and beat-synchronous mixup strategies improve both the quality and novelty of generated music.
- Score: 32.482588500419006
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Diffusion models have shown promising results in cross-modal generation
tasks, including text-to-image and text-to-audio generation. However,
generating music, as a special type of audio, presents unique challenges due to
limited availability of music data and sensitive issues related to copyright
and plagiarism. In this paper, to tackle these challenges, we first construct a
state-of-the-art text-to-music model, MusicLDM, that adapts Stable Diffusion
and AudioLDM architectures to the music domain. We achieve this by retraining
the contrastive language-audio pretraining model (CLAP) and the Hifi-GAN
vocoder, as components of MusicLDM, on a collection of music data samples.
Then, to address the limitations of training data and to avoid plagiarism, we
leverage a beat tracking model and propose two different mixup strategies for
data augmentation: beat-synchronous audio mixup and beat-synchronous latent
mixup, which recombine training audio directly or via a latent embeddings
space, respectively. Such mixup strategies encourage the model to interpolate
between musical training samples and generate new music within the convex hull
of the training data, making the generated music more diverse while still
staying faithful to the corresponding style. In addition to popular evaluation
metrics, we design several new evaluation metrics based on CLAP score to
demonstrate that our proposed MusicLDM and beat-synchronous mixup strategies
improve both the quality and novelty of generated music, as well as the
correspondence between input text and generated music.
Related papers
- UniMuMo: Unified Text, Music and Motion Generation [57.72514622935806]
We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities.
By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture.
arXiv Detail & Related papers (2024-10-06T16:04:05Z) - Text Conditioned Symbolic Drumbeat Generation using Latent Diffusion Models [0.0]
This study introduces a text-conditioned approach to generating drumbeats with Latent Diffusion Models (LDMs)
By pretraining a text and drumbeat encoder through contrastive learning within a multimodal network, we align the modalities of text and music closely.
We show that the generated drumbeats are novel and apt to the prompt text, and comparable in quality to those created by human musicians.
arXiv Detail & Related papers (2024-08-05T13:23:05Z) - Combining audio control and style transfer using latent diffusion [1.705371629600151]
In this paper, we aim to unify explicit control and style transfer within a single model.
Our model can generate audio matching a timbre target, while specifying structure either with explicit controls or through another audio example.
We show that our method can generate cover versions of complete musical pieces by transferring rhythmic and melodic content to the style of a target audio in a different genre.
arXiv Detail & Related papers (2024-07-31T23:27:27Z) - LARP: Language Audio Relational Pre-training for Cold-Start Playlist Continuation [49.89372182441713]
We introduce LARP, a multi-modal cold-start playlist continuation model.
Our framework uses increasing stages of task-specific abstraction: within-track (language-audio) contrastive loss, track-track contrastive loss, and track-playlist contrastive loss.
arXiv Detail & Related papers (2024-06-20T14:02:15Z) - QA-MDT: Quality-aware Masked Diffusion Transformer for Enhanced Music Generation [46.301388755267986]
We propose a novel paradigm for high-quality music generation that incorporates a quality-aware training strategy.
We first adapted and implemented a masked diffusion transformer (MDT) model for the TTM task, demonstrating its capacity for quality control and enhanced musicality.
Experiments demonstrate our state-of-the-art (SOTA) performance on MusicCaps and the Song-Describer dataset.
arXiv Detail & Related papers (2024-05-24T18:09:27Z) - MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music.
To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation)
Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z) - Video2Music: Suitable Music Generation from Videos using an Affective
Multimodal Transformer model [32.801213106782335]
We develop a generative music AI framework, Video2Music, that can match a provided video.
In a thorough experiment, we show that our proposed framework can generate music that matches the video content in terms of emotion.
arXiv Detail & Related papers (2023-11-02T03:33:00Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - GETMusic: Generating Any Music Tracks with a Unified Representation and
Diffusion Framework [58.64512825534638]
Symbolic music generation aims to create musical notes, which can help users compose music.
We introduce a framework known as GETMusic, with GET'' standing for GEnerate music Tracks''
GETScore represents musical notes as tokens and organizes tokens in a 2D structure, with tracks stacked vertically and progressing horizontally over time.
Our proposed representation, coupled with the non-autoregressive generative model, empowers GETMusic to generate music with any arbitrary source-target track combinations.
arXiv Detail & Related papers (2023-05-18T09:53:23Z) - RMSSinger: Realistic-Music-Score based Singing Voice Synthesis [56.51475521778443]
RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types.
We propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input.
In RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment.
arXiv Detail & Related papers (2023-05-18T03:57:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.