ZeroSep: Separate Anything in Audio with Zero Training
- URL: http://arxiv.org/abs/2505.23625v1
- Date: Thu, 29 May 2025 16:31:45 GMT
- Title: ZeroSep: Separate Anything in Audio with Zero Training
- Authors: Chao Huang, Yuesheng Ma, Junxuan Huang, Susan Liang, Yunlong Tang, Jing Bi, Wenqiang Liu, Nima Mesgarani, Chenliang Xu,
- Abstract summary: Audio source separation is fundamental for machines to understand complex acoustic environments.<n>Current supervised deep learning approaches, while powerful, are limited by the need for extensive, task-specific labeled data.<n>We investigate whether pre-trained text-guided audio diffusion models can overcome these limitations.<n>We make a surprising discovery: zero-shot source separation can be achieved purely through a pre-trained text-guided audio diffusion model.
- Score: 42.19808124670159
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio source separation is fundamental for machines to understand complex acoustic environments and underpins numerous audio applications. Current supervised deep learning approaches, while powerful, are limited by the need for extensive, task-specific labeled data and struggle to generalize to the immense variability and open-set nature of real-world acoustic scenes. Inspired by the success of generative foundation models, we investigate whether pre-trained text-guided audio diffusion models can overcome these limitations. We make a surprising discovery: zero-shot source separation can be achieved purely through a pre-trained text-guided audio diffusion model under the right configuration. Our method, named ZeroSep, works by inverting the mixed audio into the diffusion model's latent space and then using text conditioning to guide the denoising process to recover individual sources. Without any task-specific training or fine-tuning, ZeroSep repurposes the generative diffusion model for a discriminative separation task and inherently supports open-set scenarios through its rich textual priors. ZeroSep is compatible with a variety of pre-trained text-guided audio diffusion backbones and delivers strong separation performance on multiple separation benchmarks, surpassing even supervised methods.
Related papers
- DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization [6.6567375919025995]
Language-queried Audio Source Separation (LASS) enables open-vocabulary sound separation via natural language queries.<n>We introduce a training-free framework leveraging generative priors for zero-shot LASS.<n>Our approach effectively repurposes pretrained diffusion models for source separation, achieving competitive performance without task-specific supervision.
arXiv Detail & Related papers (2025-06-03T13:24:57Z) - Text-Queried Audio Source Separation via Hierarchical Modeling [53.94434504259829]
We propose a hierarchical decomposition framework, HSM-TSS, that decouples the task into global-local semantic-guided feature separation and structure-preserving acoustic reconstruction.<n>A Q-Audio architecture is employed to align audio and text modalities, serving as pretrained global-semantic encoders.<n>Our method achieves state-of-the-art separation performance with data-efficient training while maintaining superior semantic consistency with queries in complex auditory scenes.
arXiv Detail & Related papers (2025-05-27T11:00:38Z) - SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline [29.85417427778784]
SoloSpeech is a cascaded generative pipeline that integrates compression, extraction, reconstruction, and correction processes.<n>It achieves the new state-of-the-art intelligibility and quality in target speech extraction and speech separation tasks.
arXiv Detail & Related papers (2025-05-25T21:00:48Z) - Unleashing the Power of Natural Audio Featuring Multiple Sound Sources [54.38251699625379]
Universal sound separation aims to extract clean audio tracks corresponding to distinct events from mixed audio.<n>We propose ClearSep, a framework that employs a data engine to decompose complex naturally mixed audio into multiple independent tracks.<n>In experiments, ClearSep achieves state-of-the-art performance across multiple sound separation tasks.
arXiv Detail & Related papers (2025-04-24T17:58:21Z) - Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes [16.530816405275715]
We present a unified model capable of simultaneously grounding both spoken language and non-speech sounds within a visual scene.<n>Existing approaches are typically limited to handling either speech or non-speech sounds independently, or at best, together but sequentially without mixing.
arXiv Detail & Related papers (2025-03-24T16:56:04Z) - Separate This, and All of these Things Around It: Music Source Separation via Hyperellipsoidal Queries [53.30852012059025]
Music source separation is an audio-to-audio retrieval task.<n>Recent work in music source separation has begun to challenge the fixed-stem paradigm.<n>We propose the use of hyperellipsoidal regions as queries to allow for an intuitive yet easily parametrizable approach to specifying both the target (location) and its spread.
arXiv Detail & Related papers (2025-01-27T16:13:50Z) - Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Zero-shot Audio Source Separation through Query-based Learning from
Weakly-labeled Data [26.058278155958668]
We propose a three-component pipeline to train a universal audio source separator from a large, but weakly-labeled dataset: AudioSet.
Our approach uses a single model for source separation of multiple sound types, and relies solely on weakly-labeled data for training.
The proposed audio separator can be used in a zero-shot setting, learning to separate types of audio sources that were never seen in training.
arXiv Detail & Related papers (2021-12-15T05:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.