Unsupervised speech enhancement with diffusion-based generative models
- URL: http://arxiv.org/abs/2309.10450v1
- Date: Tue, 19 Sep 2023 09:11:31 GMT
- Title: Unsupervised speech enhancement with diffusion-based generative models
- Authors: Bern\'e Nortier (MULTISPEECH), Mostafa Sadeghi (MULTISPEECH), Romain
Serizel (MULTISPEECH)
- Abstract summary: We introduce an alternative approach that operates in an unsupervised manner, leveraging the generative power of diffusion models.
We develop a posterior sampling methodology for speech enhancement by combining the learnt clean speech prior with a noise model for speech signal inference.
We show promising results compared to a recent variational auto-encoder (VAE)-based unsupervised approach and a state-of-the-art diffusion-based supervised method.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, conditional score-based diffusion models have gained significant
attention in the field of supervised speech enhancement, yielding
state-of-the-art performance. However, these methods may face challenges when
generalising to unseen conditions. To address this issue, we introduce an
alternative approach that operates in an unsupervised manner, leveraging the
generative power of diffusion models. Specifically, in a training phase, a
clean speech prior distribution is learnt in the short-time Fourier transform
(STFT) domain using score-based diffusion models, allowing it to
unconditionally generate clean speech from Gaussian noise. Then, we develop a
posterior sampling methodology for speech enhancement by combining the learnt
clean speech prior with a noise model for speech signal inference. The noise
parameters are simultaneously learnt along with clean speech estimation through
an iterative expectationmaximisation (EM) approach. To the best of our
knowledge, this is the first work exploring diffusion-based generative models
for unsupervised speech enhancement, demonstrating promising results compared
to a recent variational auto-encoder (VAE)-based unsupervised approach and a
state-of-the-art diffusion-based supervised method. It thus opens a new
direction for future research in unsupervised speech enhancement.
Related papers
- Energy-Based Diffusion Language Models for Text Generation [126.23425882687195]
Energy-based Diffusion Language Model (EDLM) is an energy-based model operating at the full sequence level for each diffusion step.
Our framework offers a 1.3$times$ sampling speedup over existing diffusion models.
arXiv Detail & Related papers (2024-10-28T17:25:56Z) - Diffusion-based Unsupervised Audio-visual Speech Enhancement [26.937216751657697]
This paper proposes a new unsupervised audiovisual speech enhancement (AVSE) approach.
It combines a diffusion-based audio-visual speech generative model with a non-negative matrix factorization (NMF) noise model.
Experimental results confirm that the proposed AVSE approach not only outperforms its audio-only counterpart but also generalizes better than a recent supervisedgenerative AVSE method.
arXiv Detail & Related papers (2024-10-04T12:22:54Z) - GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model [0.0]
We propose GLA-Grad, which consists in introducing a phase recovery algorithm such as the Griffin-Lim algorithm (GLA) at each step of the regular diffusion process.
We show that our algorithm outperforms state-of-the-art diffusion models for speech generation, especially when generating speech for a previously unseen target speaker.
arXiv Detail & Related papers (2024-02-09T12:12:52Z) - SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - Investigating the Design Space of Diffusion Models for Speech Enhancement [17.914763947871368]
Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature.
We show that the performance of previous diffusion-based speech enhancement systems cannot be attributed to the progressive transformation between the clean and noisy speech signals.
We also show that a proper choice of preconditioning, training loss weighting, SDE and sampler allows to outperform a popular diffusion-based speech enhancement system.
arXiv Detail & Related papers (2023-12-07T15:40:55Z) - Diffusion-based speech enhancement with a weighted generative-supervised
learning loss [0.0]
Diffusion-based generative models have recently gained attention in speech enhancement (SE)
We propose augmenting the original diffusion training objective with a mean squared error (MSE) loss, measuring the discrepancy between estimated enhanced speech and ground-truth clean speech.
arXiv Detail & Related papers (2023-09-19T09:13:35Z) - An Efficient Membership Inference Attack for the Diffusion Model by
Proximal Initialization [58.88327181933151]
In this paper, we propose an efficient query-based membership inference attack (MIA)
Experimental results indicate that the proposed method can achieve competitive performance with only two queries on both discrete-time and continuous-time diffusion models.
To the best of our knowledge, this work is the first to study the robustness of diffusion models to MIA in the text-to-speech task.
arXiv Detail & Related papers (2023-05-26T16:38:48Z) - DiffusionBERT: Improving Generative Masked Language Models with
Diffusion Models [81.84866217721361]
DiffusionBERT is a new generative masked language model based on discrete diffusion models.
We propose a new noise schedule for the forward diffusion process that controls the degree of noise added at each step.
Experiments on unconditional text generation demonstrate that DiffusionBERT achieves significant improvement over existing diffusion models for text.
arXiv Detail & Related papers (2022-11-28T03:25:49Z) - Speech Enhancement and Dereverberation with Diffusion-based Generative
Models [14.734454356396157]
We present a detailed overview of the diffusion process that is based on a differential equation.
We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates.
In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models.
arXiv Detail & Related papers (2022-08-11T13:55:12Z) - Conditional Diffusion Probabilistic Model for Speech Enhancement [101.4893074984667]
We propose a novel speech enhancement algorithm that incorporates characteristics of the observed noisy speech signal into the diffusion and reverse processes.
In our experiments, we demonstrate strong performance of the proposed approach compared to representative generative models.
arXiv Detail & Related papers (2022-02-10T18:58:01Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.