Zero-shot Audio Source Separation through Query-based Learning from
Weakly-labeled Data
- URL: http://arxiv.org/abs/2112.07891v2
- Date: Thu, 16 Dec 2021 09:06:57 GMT
- Title: Zero-shot Audio Source Separation through Query-based Learning from
Weakly-labeled Data
- Authors: Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-kirkpatrick,
Shlomo Dubnov
- Abstract summary: We propose a three-component pipeline to train a universal audio source separator from a large, but weakly-labeled dataset: AudioSet.
Our approach uses a single model for source separation of multiple sound types, and relies solely on weakly-labeled data for training.
The proposed audio separator can be used in a zero-shot setting, learning to separate types of audio sources that were never seen in training.
- Score: 26.058278155958668
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning techniques for separating audio into different sound sources
face several challenges. Standard architectures require training separate
models for different types of audio sources. Although some universal separators
employ a single model to target multiple sources, they have difficulty
generalizing to unseen sources. In this paper, we propose a three-component
pipeline to train a universal audio source separator from a large, but
weakly-labeled dataset: AudioSet. First, we propose a transformer-based sound
event detection system for processing weakly-labeled training data. Second, we
devise a query-based audio separation model that leverages this data for model
training. Third, we design a latent embedding processor to encode queries that
specify audio targets for separation, allowing for zero-shot generalization.
Our approach uses a single model for source separation of multiple sound types,
and relies solely on weakly-labeled data for training. In addition, the
proposed audio separator can be used in a zero-shot setting, learning to
separate types of audio sources that were never seen in training. To evaluate
the separation performance, we test our model on MUSDB18, while training on the
disjoint AudioSet. We further verify the zero-shot performance by conducting
another experiment on audio source types that are held-out from training. The
model achieves comparable Source-to-Distortion Ratio (SDR) performance to
current supervised models in both cases.
Related papers
- Universal Sound Separation with Self-Supervised Audio Masked Autoencoder [35.560261097213846]
We propose integrating a self-supervised pre-trained model, namely the audio masked autoencoder (A-MAE), into a universal sound separation system.
The proposed methods successfully enhance the separation performance of a state-of-the-art ResUNet-based USS model.
arXiv Detail & Related papers (2024-07-16T14:11:44Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Separate Anything You Describe [55.0784713558149]
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA)
AudioSep is a foundation model for open-domain audio source separation with natural language queries.
arXiv Detail & Related papers (2023-08-09T16:09:44Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled
Videos [44.14061539284888]
We propose to approach text-queried universal sound separation by using only unlabeled data.
The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model.
While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting.
arXiv Detail & Related papers (2022-12-14T07:21:45Z) - Separate What You Describe: Language-Queried Audio Source Separation [53.65665794338574]
We introduce the task of language-queried audio source separation (LASS)
LASS aims to separate a target source from an audio mixture based on a natural language query of the target source.
We propose LASS-Net, an end-to-end neural network that is learned to jointly process acoustic and linguistic information.
arXiv Detail & Related papers (2022-03-28T23:47:57Z) - Unsupervised Audio Source Separation Using Differentiable Parametric
Source Models [8.80867379881193]
We propose an unsupervised model-based deep learning approach to musical source separation.
A neural network is trained to reconstruct the observed mixture as a sum of the sources.
The experimental evaluation on a vocal ensemble separation task shows that the proposed method outperforms learning-free methods.
arXiv Detail & Related papers (2022-01-24T11:05:30Z) - Unsupervised Source Separation By Steering Pretrained Music Models [15.847814664948013]
We showcase an unsupervised method that repurposes deep models trained for music generation and music tagging for audio source separation.
An audio generation model is conditioned on an input mixture, producing a latent encoding of the audio used to generate audio.
This generated audio is fed to a pretrained music tagger that creates source labels.
arXiv Detail & Related papers (2021-10-25T16:08:28Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - Sampling-Frequency-Independent Audio Source Separation Using Convolution
Layer Based on Impulse Invariant Method [67.24600975813419]
We propose a convolution layer capable of handling arbitrary sampling frequencies by a single deep neural network.
We show that the introduction of the proposed layer enables a conventional audio source separation model to consistently work with even unseen sampling frequencies.
arXiv Detail & Related papers (2021-05-10T02:33:42Z) - Leveraging Category Information for Single-Frame Visual Sound Source
Separation [15.26733033527393]
We study simple yet efficient models for visual sound separation using only a single video frame.
Our models are able to exploit the information of the sound source category in the separation process.
arXiv Detail & Related papers (2020-07-15T20:35:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.