Related papers: DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization

DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization

URL: http://arxiv.org/abs/2506.02858v2
Date: Thu, 05 Jun 2025 04:46:57 GMT
Title: DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization
Authors: Geonyoung Lee, Geonhee Han, Paul Hongsuck Seo,
Abstract summary: Language-queried Audio Source Separation (LASS) enables open-vocabulary sound separation via natural language queries.<n>We introduce a training-free framework leveraging generative priors for zero-shot LASS.<n>Our approach effectively repurposes pretrained diffusion models for source separation, achieving competitive performance without task-specific supervision.
Score: 6.6567375919025995
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Language-queried Audio Source Separation (LASS) enables open-vocabulary sound separation via natural language queries. While existing methods rely on task-specific training, we explore whether pretrained diffusion models, originally designed for audio generation, can inherently perform separation without further training. In this study, we introduce a training-free framework leveraging generative priors for zero-shot LASS. Analyzing naive adaptations, we identify key limitations arising from modality-specific challenges. To address these issues, we propose Diffusion-Guided Mask Optimization (DGMO), a test-time optimization framework that refines spectrogram masks for precise, input-aligned separation. Our approach effectively repurposes pretrained diffusion models for source separation, achieving competitive performance without task-specific supervision. This work expands the application of diffusion models beyond generation, establishing a new paradigm for zero-shot audio separation. The code is available at: https://wltschmrz.github.io/DGMO/

Related papers

ZeroSep: Separate Anything in Audio with Zero Training [42.19808124670159]
Audio source separation is fundamental for machines to understand complex acoustic environments.<n>Current supervised deep learning approaches, while powerful, are limited by the need for extensive, task-specific labeled data.<n>We investigate whether pre-trained text-guided audio diffusion models can overcome these limitations.<n>We make a surprising discovery: zero-shot source separation can be achieved purely through a pre-trained text-guided audio diffusion model.
arXiv Detail & Related papers (2025-05-29T16:31:45Z)
Generalized Interpolating Discrete Diffusion [65.74168524007484]
Masked diffusion is a popular choice due to its simplicity and effectiveness.<n>We derive the theoretical backbone of a family of general interpolating discrete diffusion processes.<n>Exploiting GIDD's flexibility, we explore a hybrid approach combining masking and uniform noise.
arXiv Detail & Related papers (2025-03-06T14:30:55Z)
Training-free Diffusion Model Alignment with Sampling Demons [15.400553977713914]
We propose an optimization approach, dubbed Demon, to guide the denoising process at inference time without backpropagation through reward functions or model retraining.<n>Our approach works by controlling noise distribution in denoising steps to concentrate density on regions corresponding to high rewards through optimization.<n>Our experiments show that the proposed approach significantly improves the average aesthetics scores text-to-image generation.
arXiv Detail & Related papers (2024-10-08T07:33:49Z)
Diffusion-based Unsupervised Audio-visual Speech Enhancement [26.937216751657697]
This paper proposes a new unsupervised audio-visual speech enhancement (AVSE) approach.<n>It combines a diffusion-based audio-visual speech generative model with a non-negative matrix factorization (NMF) noise model.<n> Experimental results confirm that the proposed AVSE approach not only outperforms its audio-only counterpart but also generalizes better than a recent supervised-generative AVSE method.
arXiv Detail & Related papers (2024-10-04T12:22:54Z)
OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio Separation [9.453883041423468]
We propose OpenSep, a novel framework that leverages large language models (LLMs) for automated audio separation. OpenSep uses textual inversion to generate captions from audio mixtures with off-the-shelf audio captioning models, effectively parsing the sound sources present. It then employs few-shot LLM prompting to extract detailed audio properties of each parsed source, facilitating separation in unseen mixtures.
arXiv Detail & Related papers (2024-09-28T06:59:52Z)
DDTSE: Discriminative Diffusion Model for Target Speech Extraction [62.422291953387955]
We introduce the Discriminative Diffusion model for Target Speech Extraction (DDTSE) We apply the same forward process as diffusion models and utilize the reconstruction loss similar to discriminative methods. We devise a two-stage training strategy to emulate the inference process during model training.
arXiv Detail & Related papers (2023-09-25T04:58:38Z)
Unsupervised speech enhancement with diffusion-based generative models [0.0]
We introduce an alternative approach that operates in an unsupervised manner, leveraging the generative power of diffusion models. We develop a posterior sampling methodology for speech enhancement by combining the learnt clean speech prior with a noise model for speech signal inference. We show promising results compared to a recent variational auto-encoder (VAE)-based unsupervised approach and a state-of-the-art diffusion-based supervised method.
arXiv Detail & Related papers (2023-09-19T09:11:31Z)
DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective. Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process. During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z)
An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization [58.88327181933151]
In this paper, we propose an efficient query-based membership inference attack (MIA) Experimental results indicate that the proposed method can achieve competitive performance with only two queries on both discrete-time and continuous-time diffusion models. To the best of our knowledge, this work is the first to study the robustness of diffusion models to MIA in the text-to-speech task.
arXiv Detail & Related papers (2023-05-26T16:38:48Z)
Dior-CVAE: Pre-trained Language Models and Diffusion Priors for Variational Dialog Generation [70.2283756542824]
Dior-CVAE is a hierarchical conditional variational autoencoder (CVAE) with diffusion priors to address these challenges. We employ a diffusion model to increase the complexity of the prior distribution and its compatibility with the distributions produced by a PLM. Experiments across two commonly used open-domain dialog datasets show that our method can generate more diverse responses without large-scale dialog pre-training.
arXiv Detail & Related papers (2023-05-24T11:06:52Z)
A Cheaper and Better Diffusion Language Model with Soft-Masked Noise [62.719656543880596]
Masked-Diffuse LM is a novel diffusion model for language modeling, inspired by linguistic features in languages. Specifically, we design a linguistic-informed forward process which adds corruptions to the text through strategically soft-masking to better noise the textual data. We demonstrate that our Masked-Diffuse LM can achieve better generation quality than the state-of-the-art diffusion models with better efficiency.
arXiv Detail & Related papers (2023-04-10T17:58:42Z)
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model [56.03464169048182]
Existing text-video retrieval solutions focus on maximizing the conditional likelihood, i.e., p(candidates|query) We creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates,query) This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise.
arXiv Detail & Related papers (2023-03-17T10:07:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.