StoRM: A Diffusion-based Stochastic Regeneration Model for Speech
Enhancement and Dereverberation
- URL: http://arxiv.org/abs/2212.11851v2
- Date: Tue, 12 Mar 2024 15:31:01 GMT
- Title: StoRM: A Diffusion-based Stochastic Regeneration Model for Speech
Enhancement and Dereverberation
- Authors: Jean-Marie Lemercier and Julius Richter and Simon Welker and Timo
Gerkmann
- Abstract summary: We present a regeneration approach where an estimate given by a predictive model is provided as a guide for further diffusion.
We show that the proposed approach uses the predictive model to remove the vocalizing and breathing artifacts while producing very high quality samples.
- Score: 20.262426487434393
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models have shown a great ability at bridging the performance gap
between predictive and generative approaches for speech enhancement. We have
shown that they may even outperform their predictive counterparts for
non-additive corruption types or when they are evaluated on mismatched
conditions. However, diffusion models suffer from a high computational burden,
mainly as they require to run a neural network for each reverse diffusion step,
whereas predictive approaches only require one pass. As diffusion models are
generative approaches they may also produce vocalizing and breathing artifacts
in adverse conditions. In comparison, in such difficult scenarios, predictive
models typically do not produce such artifacts but tend to distort the target
speech instead, thereby degrading the speech quality. In this work, we present
a stochastic regeneration approach where an estimate given by a predictive
model is provided as a guide for further diffusion. We show that the proposed
approach uses the predictive model to remove the vocalizing and breathing
artifacts while producing very high quality samples thanks to the diffusion
model, even in adverse conditions. We further show that this approach enables
to use lighter sampling schemes with fewer diffusion steps without sacrificing
quality, thus lifting the computational burden by an order of magnitude. Source
code and audio examples are available online (https://uhh.de/inf-sp-storm).
Related papers
- Energy-Based Diffusion Language Models for Text Generation [126.23425882687195]
Energy-based Diffusion Language Model (EDLM) is an energy-based model operating at the full sequence level for each diffusion step.
Our framework offers a 1.3$times$ sampling speedup over existing diffusion models.
arXiv Detail & Related papers (2024-10-28T17:25:56Z) - Diffusion Model with Perceptual Loss [4.67483805599143]
Diffusion models trained with mean squared error loss tend to generate unrealistic samples.
We show that the effectiveness of classifier-free guidance partly originates from it being a form of implicit perceptual guidance.
We propose a novel self-perceptual objective that results in diffusion models capable of generating more realistic samples.
arXiv Detail & Related papers (2023-12-30T01:24:25Z) - Soft Mixture Denoising: Beyond the Expressive Bottleneck of Diffusion
Models [76.46246743508651]
We show that current diffusion models actually have an expressive bottleneck in backward denoising.
We introduce soft mixture denoising (SMD), an expressive and efficient model for backward denoising.
arXiv Detail & Related papers (2023-09-25T12:03:32Z) - Unsupervised speech enhancement with diffusion-based generative models [0.0]
We introduce an alternative approach that operates in an unsupervised manner, leveraging the generative power of diffusion models.
We develop a posterior sampling methodology for speech enhancement by combining the learnt clean speech prior with a noise model for speech signal inference.
We show promising results compared to a recent variational auto-encoder (VAE)-based unsupervised approach and a state-of-the-art diffusion-based supervised method.
arXiv Detail & Related papers (2023-09-19T09:11:31Z) - Diffusion Models in Vision: A Survey [80.82832715884597]
A diffusion model is a deep generative model that is based on two stages, a forward diffusion stage and a reverse diffusion stage.
Diffusion models are widely appreciated for the quality and diversity of the generated samples, despite their known computational burdens.
arXiv Detail & Related papers (2022-09-10T22:00:30Z) - How Much is Enough? A Study on Diffusion Times in Score-based Generative
Models [76.76860707897413]
Current best practice advocates for a large T to ensure that the forward dynamics brings the diffusion sufficiently close to a known and simple noise distribution.
We show how an auxiliary model can be used to bridge the gap between the ideal and the simulated forward dynamics, followed by a standard reverse diffusion process.
arXiv Detail & Related papers (2022-06-10T15:09:46Z) - Truncated Diffusion Probabilistic Models and Diffusion-based Adversarial
Auto-Encoders [137.1060633388405]
Diffusion-based generative models learn how to generate the data by inferring a reverse diffusion chain.
We propose a faster and cheaper approach that adds noise not until the data become pure random noise.
We show that the proposed model can be cast as an adversarial auto-encoder empowered by both the diffusion process and a learnable implicit prior.
arXiv Detail & Related papers (2022-02-19T20:18:49Z) - Conditional Diffusion Probabilistic Model for Speech Enhancement [101.4893074984667]
We propose a novel speech enhancement algorithm that incorporates characteristics of the observed noisy speech signal into the diffusion and reverse processes.
In our experiments, we demonstrate strong performance of the proposed approach compared to representative generative models.
arXiv Detail & Related papers (2022-02-10T18:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.