Unsupervised Source Separation By Steering Pretrained Music Models
- URL: http://arxiv.org/abs/2110.13071v1
- Date: Mon, 25 Oct 2021 16:08:28 GMT
- Title: Unsupervised Source Separation By Steering Pretrained Music Models
- Authors: Ethan Manilow, Patrick O'Reilly, Prem Seetharaman, Bryan Pardo
- Abstract summary: We showcase an unsupervised method that repurposes deep models trained for music generation and music tagging for audio source separation.
An audio generation model is conditioned on an input mixture, producing a latent encoding of the audio used to generate audio.
This generated audio is fed to a pretrained music tagger that creates source labels.
- Score: 15.847814664948013
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We showcase an unsupervised method that repurposes deep models trained for
music generation and music tagging for audio source separation, without any
retraining. An audio generation model is conditioned on an input mixture,
producing a latent encoding of the audio used to generate audio. This generated
audio is fed to a pretrained music tagger that creates source labels. The
cross-entropy loss between the tag distribution for the generated audio and a
predefined distribution for an isolated source is used to guide gradient ascent
in the (unchanging) latent space of the generative model. This system does not
update the weights of the generative model or the tagger, and only relies on
moving through the generative model's latent space to produce separated
sources. We use OpenAI's Jukebox as the pretrained generative model, and we
couple it with four kinds of pretrained music taggers (two architectures and
two tagging datasets). Experimental results on two source separation datasets,
show this approach can produce separation estimates for a wider variety of
sources than any tested supervised or unsupervised system. This work points to
the vast and heretofore untapped potential of large pretrained music models for
audio-to-audio tasks like source separation.
Related papers
- Multi-Source Music Generation with Latent Diffusion [7.832209959041259]
Multi-Source Diffusion Model (MSDM) proposed to model music as a mixture of multiple instrumental sources.
MSLDM employs Variational Autoencoders (VAEs) to encode each instrumental source into a distinct latent representation.
This approach significantly enhances the total and partial generation of music.
arXiv Detail & Related papers (2024-09-10T03:41:10Z) - Source Separation of Multi-source Raw Music using a Residual Quantized Variational Autoencoder [0.0]
I develop a neural audio model based on the residual quantized variational autoencoder architecture.
The model can separate audio sources, achieving almost SoTA results with much less computing power.
arXiv Detail & Related papers (2024-08-12T17:30:17Z) - Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Controllable Music Production with Diffusion Models and Guidance
Gradients [3.187381965457262]
We demonstrate how conditional generation from diffusion models can be used to tackle a variety of realistic tasks in 44.1kHz stereo audio.
The scenarios we consider include continuation, inpainting and regeneration of musical audio, the creation of smooth transitions between two different music tracks, and the transfer of desired stylistic characteristics to existing audio clips.
arXiv Detail & Related papers (2023-11-01T16:01:01Z) - Multi-Source Diffusion Models for Simultaneous Music Generation and Separation [17.124189082882395]
We train our model on Slakh2100, a standard dataset for musical source separation.
Our method is the first example of a single model that can handle both generation and separation tasks.
arXiv Detail & Related papers (2023-02-04T23:18:36Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - Separate And Diffuse: Using a Pretrained Diffusion Model for Improving
Source Separation [99.19786288094596]
We show how the upper bound can be generalized to the case of random generative models.
We show state-of-the-art results on 2, 3, 5, 10, and 20 speakers on multiple benchmarks.
arXiv Detail & Related papers (2023-01-25T18:21:51Z) - Zero-shot Audio Source Separation through Query-based Learning from
Weakly-labeled Data [26.058278155958668]
We propose a three-component pipeline to train a universal audio source separator from a large, but weakly-labeled dataset: AudioSet.
Our approach uses a single model for source separation of multiple sound types, and relies solely on weakly-labeled data for training.
The proposed audio separator can be used in a zero-shot setting, learning to separate types of audio sources that were never seen in training.
arXiv Detail & Related papers (2021-12-15T05:13:43Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.