Controllable Music Production with Diffusion Models and Guidance
Gradients
- URL: http://arxiv.org/abs/2311.00613v2
- Date: Tue, 5 Dec 2023 10:32:03 GMT
- Title: Controllable Music Production with Diffusion Models and Guidance
Gradients
- Authors: Mark Levy, Bruno Di Giorgi, Floris Weers, Angelos Katharopoulos, Tom
Nickson
- Abstract summary: We demonstrate how conditional generation from diffusion models can be used to tackle a variety of realistic tasks in 44.1kHz stereo audio.
The scenarios we consider include continuation, inpainting and regeneration of musical audio, the creation of smooth transitions between two different music tracks, and the transfer of desired stylistic characteristics to existing audio clips.
- Score: 3.187381965457262
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We demonstrate how conditional generation from diffusion models can be used
to tackle a variety of realistic tasks in the production of music in 44.1kHz
stereo audio with sampling-time guidance. The scenarios we consider include
continuation, inpainting and regeneration of musical audio, the creation of
smooth transitions between two different music tracks, and the transfer of
desired stylistic characteristics to existing audio clips. We achieve this by
applying guidance at sampling time in a simple framework that supports both
reconstruction and classification losses, or any combination of the two. This
approach ensures that generated audio can match its surrounding context, or
conform to a class distribution or latent representation specified relative to
any suitable pre-trained classifier or embedding model. Audio samples are
available at https://machinelearning.apple.com/research/controllable-music
Related papers
- MusicFlow: Cascaded Flow Matching for Text Guided Music Generation [53.63948108922333]
MusicFlow is a cascaded text-to-music generation model based on flow matching.
We leverage masked prediction as the training objective, enabling the model to generalize to other tasks such as music infilling and continuation.
arXiv Detail & Related papers (2024-10-27T15:35:41Z) - Combining audio control and style transfer using latent diffusion [1.705371629600151]
In this paper, we aim to unify explicit control and style transfer within a single model.
Our model can generate audio matching a timbre target, while specifying structure either with explicit controls or through another audio example.
We show that our method can generate cover versions of complete musical pieces by transferring rhythmic and melodic content to the style of a target audio in a different genre.
arXiv Detail & Related papers (2024-07-31T23:27:27Z) - Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - Bass Accompaniment Generation via Latent Diffusion [0.0]
We present a controllable system for generating single stems to accompany musical mixes of arbitrary length.
At the core of our method are audio autoencoders that efficiently compress audio waveform samples into invertible latent representations.
Our controllable conditional audio generation framework represents a significant step forward in creating generative AI tools to assist musicians in music production.
arXiv Detail & Related papers (2024-02-02T13:44:47Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - AudioLM: a Language Modeling Approach to Audio Generation [59.19364975706805]
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency.
We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure.
We demonstrate how our approach extends beyond speech by generating coherent piano music continuations.
arXiv Detail & Related papers (2022-09-07T13:40:08Z) - Learning music audio representations via weak language supervision [14.335950077921435]
We design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks.
weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track.
We demonstrate the usefulness of our approach by comparing the performance of audio representations produced by the same audio backbone with different training strategies.
arXiv Detail & Related papers (2021-12-08T10:30:52Z) - Strumming to the Beat: Audio-Conditioned Contrastive Video Textures [112.6140796961121]
We introduce a non-parametric approach for infinite video texture synthesis using a representation learned via contrastive learning.
We take inspiration from Video Textures, which showed that plausible new videos could be generated from a single one by stitching its frames together in a novel yet consistent order.
Our model outperforms baselines on human perceptual scores, can handle a diverse range of input videos, and can combine semantic and audio-visual cues in order to synthesize videos that synchronize well with an audio signal.
arXiv Detail & Related papers (2021-04-06T17:24:57Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z) - Generative Modelling for Controllable Audio Synthesis of Expressive
Piano Performance [6.531546527140474]
controllable neural audio synthesizer based on Gaussian Mixture Variational Autoencoders (GM-VAE)
We demonstrate how the model is able to apply fine-grained style morphing over the course of the audio.
arXiv Detail & Related papers (2020-06-16T12:54:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.