Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions
Using a Heun-Based Sampler
- URL: http://arxiv.org/abs/2312.02683v2
- Date: Tue, 16 Jan 2024 16:17:57 GMT
- Title: Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions
Using a Heun-Based Sampler
- Authors: Philippe Gonzalez, Zheng-Hua Tan, Jan {\O}stergaard, Jesper Jensen,
Tommy Sonne Alstr{\o}m, Tobias May
- Abstract summary: Diffusion models are a new class of generative models that have recently been applied to speech enhancement successfully.
Previous works have demonstrated their superior performance in mismatched conditions compared to state-of-the art discriminative models.
We show that a proposed system substantially benefits from using multiple databases for training, and achieves superior performance compared to state-of-the-art discriminative models in both matched and mismatched conditions.
- Score: 16.13996677489119
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models are a new class of generative models that have recently been
applied to speech enhancement successfully. Previous works have demonstrated
their superior performance in mismatched conditions compared to state-of-the
art discriminative models. However, this was investigated with a single
database for training and another one for testing, which makes the results
highly dependent on the particular databases. Moreover, recent developments
from the image generation literature remain largely unexplored for speech
enhancement. These include several design aspects of diffusion models, such as
the noise schedule or the reverse sampler. In this work, we systematically
assess the generalization performance of a diffusion-based speech enhancement
model by using multiple speech, noise and binaural room impulse response (BRIR)
databases to simulate mismatched acoustic conditions. We also experiment with a
noise schedule and a sampler that have not been applied to speech enhancement
before. We show that the proposed system substantially benefits from using
multiple databases for training, and achieves superior performance compared to
state-of-the-art discriminative models in both matched and mismatched
conditions. We also show that a Heun-based sampler achieves superior
performance at a smaller computational cost compared to a sampler commonly used
for speech enhancement.
Related papers
- Extract and Diffuse: Latent Integration for Improved Diffusion-based Speech and Vocal Enhancement [14.060387207656046]
Diffusion-based generative models have recently achieved remarkable results in speech and vocal enhancement.
We propose Ex-Diff, a novel score-based diffusion model that integrates the latent representations produced by a discriminative model to improve speech and vocal enhancement.
arXiv Detail & Related papers (2024-09-15T07:25:08Z) - Robustness of Speech Separation Models for Similar-pitch Speakers [14.941946672578863]
Single-channel speech separation is a crucial task for enhancing speech recognition systems in multi-speaker environments.
This paper investigates the robustness of state-of-the-art Neural Network models in scenarios where the pitch differences between speakers are minimal.
arXiv Detail & Related papers (2024-07-22T15:55:08Z) - Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation [59.184980778643464]
Fine-tuning Diffusion Models remains an underexplored frontier in generative artificial intelligence (GenAI)
In this paper, we introduce an innovative technique called self-play fine-tuning for diffusion models (SPIN-Diffusion)
Our approach offers an alternative to conventional supervised fine-tuning and RL strategies, significantly improving both model performance and alignment.
arXiv Detail & Related papers (2024-02-15T18:59:18Z) - Generative Pre-training for Speech with Flow Matching [81.59952572752248]
We pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions.
Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis.
arXiv Detail & Related papers (2023-10-25T03:40:50Z) - An Efficient Membership Inference Attack for the Diffusion Model by
Proximal Initialization [58.88327181933151]
In this paper, we propose an efficient query-based membership inference attack (MIA)
Experimental results indicate that the proposed method can achieve competitive performance with only two queries on both discrete-time and continuous-time diffusion models.
To the best of our knowledge, this work is the first to study the robustness of diffusion models to MIA in the text-to-speech task.
arXiv Detail & Related papers (2023-05-26T16:38:48Z) - Speech Enhancement and Dereverberation with Diffusion-based Generative
Models [14.734454356396157]
We present a detailed overview of the diffusion process that is based on a differential equation.
We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates.
In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models.
arXiv Detail & Related papers (2022-08-11T13:55:12Z) - Conditional Diffusion Probabilistic Model for Speech Enhancement [101.4893074984667]
We propose a novel speech enhancement algorithm that incorporates characteristics of the observed noisy speech signal into the diffusion and reverse processes.
In our experiments, we demonstrate strong performance of the proposed approach compared to representative generative models.
arXiv Detail & Related papers (2022-02-10T18:58:01Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Generating diverse and natural text-to-speech samples using a quantized
fine-grained VAE and auto-regressive prosody prior [53.69310441063162]
This paper proposes a sequential prior in a discrete latent space which can generate more naturally sounding samples.
We evaluate the approach using listening tests, objective metrics of automatic speech recognition (ASR) performance, and measurements of prosody attributes.
arXiv Detail & Related papers (2020-02-06T12:35:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.