Unsupervised Blind Joint Dereverberation and Room Acoustics Estimation with Diffusion Models
- URL: http://arxiv.org/abs/2408.07472v1
- Date: Wed, 14 Aug 2024 11:31:32 GMT
- Title: Unsupervised Blind Joint Dereverberation and Room Acoustics Estimation with Diffusion Models
- Authors: Jean-Marie Lemercier, Eloi Moliner, Simon Welker, Vesa Välimäki, Timo Gerkmann,
- Abstract summary: We present an unsupervised method for blind dereverberation and room impulse response estimation, called BUDDy.
In a blind scenario where the room impulse response is unknown, BUDDy successfully performs speech dereverberation.
Unlike supervised methods, which often struggle to generalize, BUDDy seamlessly adapts to different acoustic conditions.
- Score: 21.669363620480333
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents an unsupervised method for single-channel blind dereverberation and room impulse response (RIR) estimation, called BUDDy. The algorithm is rooted in Bayesian posterior sampling: it combines a likelihood model enforcing fidelity to the reverberant measurement, and an anechoic speech prior implemented by an unconditional diffusion model. We design a parametric filter representing the RIR, with exponential decay for each frequency subband. Room acoustics estimation and speech dereverberation are jointly carried out, as the filter parameters are iteratively estimated and the speech utterance refined along the reverse diffusion trajectory. In a blind scenario where the room impulse response is unknown, BUDDy successfully performs speech dereverberation in various acoustic scenarios, significantly outperforming other blind unsupervised baselines. Unlike supervised methods, which often struggle to generalize, BUDDy seamlessly adapts to different acoustic conditions. This paper extends our previous work by offering new experimental results and insights into the algorithm's performance and versatility. We first investigate the robustness of informed dereverberation methods to RIR estimation errors, to motivate the joint acoustic estimation and dereverberation paradigm. Then, we demonstrate the adaptability of our method to high-resolution singing voice dereverberation, study its performance in RIR estimation, and conduct subjective evaluation experiments to validate the perceptual quality of the results, among other contributions. Audio samples and code can be found online.
Related papers
- DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval [49.076590578101985]
We present a diffusion-based ATR framework (DiffATR) that generates joint distribution from noise.
Experiments on the AudioCaps and Clotho datasets with superior performances, verify the effectiveness of our approach.
arXiv Detail & Related papers (2024-09-16T06:33:26Z) - BUDDy: Single-Channel Blind Unsupervised Dereverberation with Diffusion Models [21.66936362048033]
We present an unsupervised single-channel method for joint blind dereverberation and room impulse response estimation.
We parameterize the reverberation operator using a filter with exponential decay for each frequency subband, and iteratively estimate the corresponding parameters as the speech utterance gets refined.
arXiv Detail & Related papers (2024-05-07T12:41:31Z) - Diffusion-based speech enhancement with a weighted generative-supervised
learning loss [0.0]
Diffusion-based generative models have recently gained attention in speech enhancement (SE)
We propose augmenting the original diffusion training objective with a mean squared error (MSE) loss, measuring the discrepancy between estimated enhanced speech and ground-truth clean speech.
arXiv Detail & Related papers (2023-09-19T09:13:35Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - Diffusion Posterior Sampling for Informed Single-Channel Dereverberation [15.16865739526702]
We present an informed single-channel dereverberation method based on conditional generation with diffusion models.
With knowledge of the room impulse response, the anechoic utterance is generated via reverse diffusion.
The proposed approach is largely more robust to measurement noise compared to a state-of-the-art informed single-channel dereverberation method.
arXiv Detail & Related papers (2023-06-21T14:14:05Z) - Speech Enhancement and Dereverberation with Diffusion-based Generative
Models [14.734454356396157]
We present a detailed overview of the diffusion process that is based on a differential equation.
We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates.
In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models.
arXiv Detail & Related papers (2022-08-11T13:55:12Z) - Mean absorption estimation from room impulse responses using virtually
supervised learning [0.0]
This paper introduces and investigates a new approach to estimate mean absorption coefficients solely from a room impulse response (RIR)
This inverse problem is tackled via virtually-supervised learning, namely, the RIR-to-absorption mapping is implicitly learned by regression on a simulated dataset using artificial neural networks.
arXiv Detail & Related papers (2021-09-01T14:06:20Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Leveraging Global Parameters for Flow-based Neural Posterior Estimation [90.21090932619695]
Inferring the parameters of a model based on experimental observations is central to the scientific method.
A particularly challenging setting is when the model is strongly indeterminate, i.e., when distinct sets of parameters yield identical observations.
We present a method for cracking such indeterminacy by exploiting additional information conveyed by an auxiliary set of observations sharing global parameters.
arXiv Detail & Related papers (2021-02-12T12:23:13Z) - Simultaneous Denoising and Dereverberation Using Deep Embedding Features [64.58693911070228]
We propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features.
At the denoising stage, the DC network is leveraged to extract noise-free deep embedding features.
At the dereverberation stage, instead of using the unsupervised K-means clustering algorithm, another neural network is utilized to estimate the anechoic speech.
arXiv Detail & Related papers (2020-04-06T06:34:01Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.