Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models
- URL: http://arxiv.org/abs/2406.08384v2
- Date: Wed, 30 Oct 2024 14:33:27 GMT
- Title: Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models
- Authors: Javier Nistal, Marco Pasini, Cyran Aouameur, Maarten Grachten, Stefan Lattner,
- Abstract summary: "Diff-A-Riff" is a Latent Diffusion Model designed to generate high-quality instrumentals adaptable to any musical context.
It produces 48kHz pseudo-stereo audio while significantly reducing inference time and memory usage.
- Score: 0.0
- License:
- Abstract: Recent advancements in deep generative models present new opportunities for music production but also pose challenges, such as high computational demands and limited audio quality. Moreover, current systems frequently rely solely on text input and typically focus on producing complete musical pieces, which is incompatible with existing workflows in music production. To address these issues, we introduce "Diff-A-Riff," a Latent Diffusion Model designed to generate high-quality instrumental accompaniments adaptable to any musical context. This model offers control through either audio references, text prompts, or both, and produces 48kHz pseudo-stereo audio while significantly reducing inference time and memory usage. We demonstrate the model's capabilities through objective metrics and subjective listening tests, with extensive examples available on the accompanying website: sonycslparis.github.io/diffariff-companion/
Related papers
- Music Consistency Models [31.415900049111023]
We present Music Consistency Models (textttMusicCM), which leverages the concept of consistency models to efficiently synthesize mel-spectrogram for music clips.
Building upon existing text-to-music diffusion models, the textttMusicCM model incorporates consistency distillation and adversarial discriminator training.
Experimental results reveal the effectiveness of our model in terms of computational efficiency, fidelity, and naturalness.
arXiv Detail & Related papers (2024-04-20T11:52:30Z) - Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z) - StemGen: A music generation model that listens [9.489938613869864]
We present an alternative paradigm for producing music generation models that can listen and respond to musical context.
We describe how such a model can be constructed using a non-autoregressive, transformer-based model architecture.
The resulting model reaches the audio quality of state-of-the-art text-conditioned models, as well as exhibiting strong musical coherence with its context.
arXiv Detail & Related papers (2023-12-14T08:09:20Z) - Controllable Music Production with Diffusion Models and Guidance
Gradients [3.187381965457262]
We demonstrate how conditional generation from diffusion models can be used to tackle a variety of realistic tasks in 44.1kHz stereo audio.
The scenarios we consider include continuation, inpainting and regeneration of musical audio, the creation of smooth transitions between two different music tracks, and the transfer of desired stylistic characteristics to existing audio clips.
arXiv Detail & Related papers (2023-11-01T16:01:01Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - HiddenSinger: High-Quality Singing Voice Synthesis via Neural Audio
Codec and Latent Diffusion Models [25.966328901566815]
We propose HiddenSinger, a high-quality singing voice synthesis system using neural audio and latent diffusion models.
In addition, our proposed model is extended to an unsupervised singing voice learning framework, HiddenSinger-U, to train the model.
Experimental results demonstrate that our model outperforms previous models in terms of audio quality.
arXiv Detail & Related papers (2023-06-12T01:21:41Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.
Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z) - Discrete Diffusion Probabilistic Models for Symbolic Music Generation [6.617487928813374]
This work presents the direct generation of Polyphonic Symbolic Music using D3PMs.
Our model exhibits state-of-the-art sample quality, according to current quantitative evaluation metrics.
We also cast a critical view on quantitative evaluation of music sample quality via statistical metrics, and present a simple algorithm.
arXiv Detail & Related papers (2023-05-16T14:43:38Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven
Portraits Animation [78.08004432704826]
We model the Talking head generation as an audio-driven temporally coherent denoising process (DiffTalk)
In this paper, we investigate the control mechanism of the talking face, and incorporate reference face images and landmarks as conditions for personality-aware generalized synthesis.
Our DiffTalk can be gracefully tailored for higher-resolution synthesis with negligible extra computational cost.
arXiv Detail & Related papers (2023-01-10T05:11:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.