Related papers: Unsupervised speech enhancement with diffusion-based generative models

Unsupervised speech enhancement with diffusion-based generative models

URL: http://arxiv.org/abs/2309.10450v1
Date: Tue, 19 Sep 2023 09:11:31 GMT
Title: Unsupervised speech enhancement with diffusion-based generative models
Authors: Bern\'e Nortier (MULTISPEECH), Mostafa Sadeghi (MULTISPEECH), Romain Serizel (MULTISPEECH)
Abstract summary: We introduce an alternative approach that operates in an unsupervised manner, leveraging the generative power of diffusion models. We develop a posterior sampling methodology for speech enhancement by combining the learnt clean speech prior with a noise model for speech signal inference. We show promising results compared to a recent variational auto-encoder (VAE)-based unsupervised approach and a state-of-the-art diffusion-based supervised method.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, conditional score-based diffusion models have gained significant attention in the field of supervised speech enhancement, yielding state-of-the-art performance. However, these methods may face challenges when generalising to unseen conditions. To address this issue, we introduce an alternative approach that operates in an unsupervised manner, leveraging the generative power of diffusion models. Specifically, in a training phase, a clean speech prior distribution is learnt in the short-time Fourier transform (STFT) domain using score-based diffusion models, allowing it to unconditionally generate clean speech from Gaussian noise. Then, we develop a posterior sampling methodology for speech enhancement by combining the learnt clean speech prior with a noise model for speech signal inference. The noise parameters are simultaneously learnt along with clean speech estimation through an iterative expectationmaximisation (EM) approach. To the best of our knowledge, this is the first work exploring diffusion-based generative models for unsupervised speech enhancement, demonstrating promising results compared to a recent variational auto-encoder (VAE)-based unsupervised approach and a state-of-the-art diffusion-based supervised method. It thus opens a new direction for future research in unsupervised speech enhancement.

Related papers

Posterior Transition Modeling for Unsupervised Diffusion-Based Speech Enhancement [26.937216751657697]
We explore unsupervised speech enhancement using diffusion models as expressive generative priors for clean speech.<n>Existing approaches guide the reverse diffusion process using noisy speech through an approximate, noise-perturbed likelihood score.<n>We propose two alternative algorithms that directly model the conditional reverse transition distribution of diffusion states.
arXiv Detail & Related papers (2025-07-03T07:42:02Z)
Generalized Interpolating Discrete Diffusion [65.74168524007484]
Masked diffusion is a popular choice due to its simplicity and effectiveness. We derive the theoretical backbone of a family of general interpolating discrete diffusion processes. Exploiting GIDD's flexibility, we explore a hybrid approach combining masking and uniform noise.
arXiv Detail & Related papers (2025-03-06T14:30:55Z)
Arbitrary-steps Image Super-resolution via Diffusion Inversion [68.78628844966019]
This study presents a new image super-resolution (SR) technique based on diffusion inversion, aiming at harnessing the rich image priors encapsulated in large pre-trained diffusion models to improve SR performance. We design a Partial noise Prediction strategy to construct an intermediate state of the diffusion model, which serves as the starting sampling point. Once trained, this noise predictor can be used to initialize the sampling process partially along the diffusion trajectory, generating the desirable high-resolution result.
arXiv Detail & Related papers (2024-12-12T07:24:13Z)
Energy-Based Diffusion Language Models for Text Generation [126.23425882687195]
Energy-based Diffusion Language Model (EDLM) is an energy-based model operating at the full sequence level for each diffusion step. Our framework offers a 1.3$times$ sampling speedup over existing diffusion models.
arXiv Detail & Related papers (2024-10-28T17:25:56Z)
Diffusion-based Unsupervised Audio-visual Speech Enhancement [26.937216751657697]
This paper proposes a new unsupervised audiovisual speech enhancement (AVSE) approach. It combines a diffusion-based audio-visual speech generative model with a non-negative matrix factorization (NMF) noise model. Experimental results confirm that the proposed AVSE approach not only outperforms its audio-only counterpart but also generalizes better than a recent supervisedgenerative AVSE method.
arXiv Detail & Related papers (2024-10-04T12:22:54Z)
GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model [0.0]
We propose GLA-Grad, which consists in introducing a phase recovery algorithm such as the Griffin-Lim algorithm (GLA) at each step of the regular diffusion process. We show that our algorithm outperforms state-of-the-art diffusion models for speech generation, especially when generating speech for a previously unseen target speaker.
arXiv Detail & Related papers (2024-02-09T12:12:52Z)
SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation. SpeechGPT-Gen is efficient in semantic and perceptual information modeling. It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z)
Investigating the Design Space of Diffusion Models for Speech Enhancement [17.914763947871368]
Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature. We show that the performance of previous diffusion-based speech enhancement systems cannot be attributed to the progressive transformation between the clean and noisy speech signals. We also show that a proper choice of preconditioning, training loss weighting, SDE and sampler allows to outperform a popular diffusion-based speech enhancement system.
arXiv Detail & Related papers (2023-12-07T15:40:55Z)
Diffusion-based speech enhancement with a weighted generative-supervised learning loss [0.0]
Diffusion-based generative models have recently gained attention in speech enhancement (SE) We propose augmenting the original diffusion training objective with a mean squared error (MSE) loss, measuring the discrepancy between estimated enhanced speech and ground-truth clean speech.
arXiv Detail & Related papers (2023-09-19T09:13:35Z)
An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization [58.88327181933151]
In this paper, we propose an efficient query-based membership inference attack (MIA) Experimental results indicate that the proposed method can achieve competitive performance with only two queries on both discrete-time and continuous-time diffusion models. To the best of our knowledge, this work is the first to study the robustness of diffusion models to MIA in the text-to-speech task.
arXiv Detail & Related papers (2023-05-26T16:38:48Z)
DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models [81.84866217721361]
DiffusionBERT is a new generative masked language model based on discrete diffusion models. We propose a new noise schedule for the forward diffusion process that controls the degree of noise added at each step. Experiments on unconditional text generation demonstrate that DiffusionBERT achieves significant improvement over existing diffusion models for text.
arXiv Detail & Related papers (2022-11-28T03:25:49Z)
Speech Enhancement and Dereverberation with Diffusion-based Generative Models [14.734454356396157]
We present a detailed overview of the diffusion process that is based on a differential equation. We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates. In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models.
arXiv Detail & Related papers (2022-08-11T13:55:12Z)
Conditional Diffusion Probabilistic Model for Speech Enhancement [101.4893074984667]
We propose a novel speech enhancement algorithm that incorporates characteristics of the observed noisy speech signal into the diffusion and reverse processes. In our experiments, we demonstrate strong performance of the proposed approach compared to representative generative models.
arXiv Detail & Related papers (2022-02-10T18:58:01Z)
A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals. The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.