Unsupervised Blind Joint Dereverberation and Room Acoustics Estimation with Diffusion Models
- URL: http://arxiv.org/abs/2408.07472v2
- Date: Tue, 25 Mar 2025 11:24:55 GMT
- Title: Unsupervised Blind Joint Dereverberation and Room Acoustics Estimation with Diffusion Models
- Authors: Jean-Marie Lemercier, Eloi Moliner, Simon Welker, Vesa Välimäki, Timo Gerkmann,
- Abstract summary: This paper presents an unsupervised method for single-channel blind dereverberation and room impulse response (RIR) estimation, called BUDDy.<n>We design a parametric filter representing the RIR, with exponential decay for each frequency subband.<n>We study BUDDy's performance for RIR estimation and observe it surpasses a state-of-the-art supervised-based estimator on mismatched acoustic conditions.
- Score: 21.669363620480333
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents an unsupervised method for single-channel blind dereverberation and room impulse response (RIR) estimation, called BUDDy. The algorithm is rooted in Bayesian posterior sampling: it combines a likelihood model enforcing fidelity to the reverberant measurement, and an anechoic speech prior implemented by an unconditional diffusion model. We design a parametric filter representing the RIR, with exponential decay for each frequency subband. Room acoustics estimation and speech dereverberation are jointly carried out, as the filter parameters are iteratively estimated and the speech utterance refined along the reverse diffusion trajectory. In a blind scenario where the RIR is unknown, BUDDy successfully performs speech dereverberation in various acoustic scenarios, significantly outperforming other blind unsupervised baselines. Unlike supervised methods, which often struggle to generalize, BUDDy seamlessly adapts to different acoustic conditions. This paper extends our previous work by offering new experimental results and insights into the algorithm's versatility. We demonstrate the robustness of our proposed method to new acoustic and speaker conditions, as well as its adaptability to high-resolution singing voice dereverberation, using both instrumental metrics and subjective listening evaluation. We study BUDDy's performance for RIR estimation and observe it surpasses a state-of-the-art supervised DNN-based estimator on mismatched acoustic conditions. Finally, we investigate the sensitivity of informed dereverberation methods to RIR estimation errors, thereby motivating the joint acoustic estimation and dereverberation design. Audio examples and code can be found online.
Related papers
- VINP: Variational Bayesian Inference with Neural Speech Prior for Joint ASR-Effective Speech Dereverberation and Blind RIR Identification [9.726628816336651]
This work proposes a variational Bayesian inference framework with neural speech prior (VINP) for joint speech dereverberation and blind RIR identification.
Experiments on single-channel speech dereverberation demonstrate that VINP reaches an advanced level in most metrics related to human perception.
arXiv Detail & Related papers (2025-02-11T02:54:28Z) - A Hybrid Model for Weakly-Supervised Speech Dereverberation [2.731944614640173]
This paper introduces a new training strategy to improve speech dereverberation systems using minimal acoustic information and reverberant (wet) speech.
Experimental results demonstrate that our method achieves more consistent performance across various objective metrics used in speech dereverberation than the state-of-the-art.
arXiv Detail & Related papers (2025-02-06T09:21:22Z) - Robust Representation Consistency Model via Contrastive Denoising [83.47584074390842]
randomized smoothing provides theoretical guarantees for certifying robustness against adversarial perturbations.
diffusion models have been successfully employed for randomized smoothing to purify noise-perturbed samples.
We reformulate the generative modeling task along the diffusion trajectories in pixel space as a discriminative task in the latent space.
arXiv Detail & Related papers (2025-01-22T18:52:06Z) - DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval [49.076590578101985]
We present a diffusion-based ATR framework (DiffATR) that generates joint distribution from noise.
Experiments on the AudioCaps and Clotho datasets with superior performances, verify the effectiveness of our approach.
arXiv Detail & Related papers (2024-09-16T06:33:26Z) - BUDDy: Single-Channel Blind Unsupervised Dereverberation with Diffusion Models [21.66936362048033]
We present an unsupervised single-channel method for joint blind dereverberation and room impulse response estimation.
We parameterize the reverberation operator using a filter with exponential decay for each frequency subband, and iteratively estimate the corresponding parameters as the speech utterance gets refined.
arXiv Detail & Related papers (2024-05-07T12:41:31Z) - Diffusion-based speech enhancement with a weighted generative-supervised
learning loss [0.0]
Diffusion-based generative models have recently gained attention in speech enhancement (SE)
We propose augmenting the original diffusion training objective with a mean squared error (MSE) loss, measuring the discrepancy between estimated enhanced speech and ground-truth clean speech.
arXiv Detail & Related papers (2023-09-19T09:13:35Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - Diffusion Posterior Sampling for Informed Single-Channel Dereverberation [15.16865739526702]
We present an informed single-channel dereverberation method based on conditional generation with diffusion models.
With knowledge of the room impulse response, the anechoic utterance is generated via reverse diffusion.
The proposed approach is largely more robust to measurement noise compared to a state-of-the-art informed single-channel dereverberation method.
arXiv Detail & Related papers (2023-06-21T14:14:05Z) - Speech Enhancement and Dereverberation with Diffusion-based Generative
Models [14.734454356396157]
We present a detailed overview of the diffusion process that is based on a differential equation.
We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates.
In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models.
arXiv Detail & Related papers (2022-08-11T13:55:12Z) - Few-Shot Audio-Visual Learning of Environment Acoustics [89.16560042178523]
Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener.
We explore how to infer RIRs based on a sparse set of images and echoes observed in the space.
In experiments using a state-of-the-art audio-visual simulator for 3D environments, we demonstrate that our method successfully generates arbitrary RIRs.
arXiv Detail & Related papers (2022-06-08T16:38:24Z) - Mean absorption estimation from room impulse responses using virtually
supervised learning [0.0]
This paper introduces and investigates a new approach to estimate mean absorption coefficients solely from a room impulse response (RIR)
This inverse problem is tackled via virtually-supervised learning, namely, the RIR-to-absorption mapping is implicitly learned by regression on a simulated dataset using artificial neural networks.
arXiv Detail & Related papers (2021-09-01T14:06:20Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Leveraging Global Parameters for Flow-based Neural Posterior Estimation [90.21090932619695]
Inferring the parameters of a model based on experimental observations is central to the scientific method.
A particularly challenging setting is when the model is strongly indeterminate, i.e., when distinct sets of parameters yield identical observations.
We present a method for cracking such indeterminacy by exploiting additional information conveyed by an auxiliary set of observations sharing global parameters.
arXiv Detail & Related papers (2021-02-12T12:23:13Z) - Simultaneous Denoising and Dereverberation Using Deep Embedding Features [64.58693911070228]
We propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features.
At the denoising stage, the DC network is leveraged to extract noise-free deep embedding features.
At the dereverberation stage, instead of using the unsupervised K-means clustering algorithm, another neural network is utilized to estimate the anechoic speech.
arXiv Detail & Related papers (2020-04-06T06:34:01Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.