Separate Anything You Describe
- URL: http://arxiv.org/abs/2308.05037v2
- Date: Fri, 27 Oct 2023 15:34:06 GMT
- Title: Separate Anything You Describe
- Authors: Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui
Xia, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang
- Abstract summary: Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA)
AudioSep is a foundation model for open-domain audio source separation with natural language queries.
- Score: 55.0784713558149
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language-queried audio source separation (LASS) is a new paradigm for
computational auditory scene analysis (CASA). LASS aims to separate a target
sound from an audio mixture given a natural language query, which provides a
natural and scalable interface for digital audio applications. Recent works on
LASS, despite attaining promising separation performance on specific sources
(e.g., musical instruments, limited classes of audio events), are unable to
separate audio concepts in the open domain. In this work, we introduce
AudioSep, a foundation model for open-domain audio source separation with
natural language queries. We train AudioSep on large-scale multimodal datasets
and extensively evaluate its capabilities on numerous tasks including audio
event separation, musical instrument separation, and speech enhancement.
AudioSep demonstrates strong separation performance and impressive zero-shot
generalization ability using audio captions or text labels as queries,
substantially outperforming previous audio-queried and language-queried sound
separation models. For reproducibility of this work, we will release the source
code, evaluation benchmark and pre-trained model at:
https://github.com/Audio-AGI/AudioSep.
Related papers
- Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models [56.776580717999806]
Real-world applications often involve processing multiple audio streams simultaneously.
We propose the first multi-audio evaluation benchmark that consists of 20 datasets from 11 multi-audio tasks.
We propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios.
arXiv Detail & Related papers (2024-09-27T12:06:53Z) - Semantic Grouping Network for Audio Source Separation [41.54814517077309]
We present a novel Semantic Grouping Network, termed as SGN, that can directly disentangle sound representations and extract high-level semantic information for each source from input audio mixture.
We conducted extensive experiments on music-only and universal sound separation benchmarks: MUSIC, FUSS, MUSDB18, and VGG-Sound.
arXiv Detail & Related papers (2024-07-04T08:37:47Z) - LAVSS: Location-Guided Audio-Visual Spatial Audio Separation [52.44052357829296]
We propose a location-guided audio-visual spatial audio separator.
The proposed LAVSS is inspired by the correlation between spatial audio and visual location.
In addition, we adopt a pre-trained monaural separator to transfer knowledge from rich mono sounds to boost spatial audio separation.
arXiv Detail & Related papers (2023-10-31T13:30:24Z) - AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining [46.22290575167155]
This paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation.
Our framework introduces a general representation of audio, called "language of audio" (LOA)
arXiv Detail & Related papers (2023-08-10T17:55:13Z) - WavJourney: Compositional Audio Creation with Large Language Models [38.39551216587242]
We present WavJourney, a novel framework that leverages Large Language Models to connect various audio models for audio creation.
WavJourney allows users to create storytelling audio content with diverse audio elements simply from textual descriptions.
We show that WavJourney is capable of synthesizing realistic audio aligned with textually-described semantic, spatial and temporal conditions.
arXiv Detail & Related papers (2023-07-26T17:54:04Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled
Videos [44.14061539284888]
We propose to approach text-queried universal sound separation by using only unlabeled data.
The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model.
While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting.
arXiv Detail & Related papers (2022-12-14T07:21:45Z) - Separate What You Describe: Language-Queried Audio Source Separation [53.65665794338574]
We introduce the task of language-queried audio source separation (LASS)
LASS aims to separate a target source from an audio mixture based on a natural language query of the target source.
We propose LASS-Net, an end-to-end neural network that is learned to jointly process acoustic and linguistic information.
arXiv Detail & Related papers (2022-03-28T23:47:57Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.