Investigating the Design Space of Diffusion Models for Speech
Enhancement
- URL: http://arxiv.org/abs/2312.04370v1
- Date: Thu, 7 Dec 2023 15:40:55 GMT
- Title: Investigating the Design Space of Diffusion Models for Speech
Enhancement
- Authors: Philippe Gonzalez, Zheng-Hua Tan, Jan {\O}stergaard, Jesper Jensen,
Tommy Sonne Alstr{\o}m, Tobias May
- Abstract summary: Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature.
We show that the performance of previous diffusion-based speech enhancement systems cannot be attributed to the progressive transformation between the clean and noisy speech signals.
We also show that a proper choice of preconditioning, training loss weighting, SDE and sampler allows to outperform a popular diffusion-based speech enhancement system in terms of perceptual metrics.
- Score: 16.13996677489119
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models are a new class of generative models that have shown
outstanding performance in image generation literature. As a consequence,
studies have attempted to apply diffusion models to other tasks, such as speech
enhancement. A popular approach in adapting diffusion models to speech
enhancement consists in modelling a progressive transformation between the
clean and noisy speech signals. However, one popular diffusion model framework
previously laid in image generation literature did not account for such a
transformation towards the system input, which prevents from relating the
existing diffusion-based speech enhancement systems with the aforementioned
diffusion model framework. To address this, we extend this framework to account
for the progressive transformation between the clean and noisy speech signals.
This allows us to apply recent developments from image generation literature,
and to systematically investigate design aspects of diffusion models that
remain largely unexplored for speech enhancement, such as the neural network
preconditioning, the training loss weighting, the stochastic differential
equation (SDE), or the amount of stochasticity injected in the reverse process.
We show that the performance of previous diffusion-based speech enhancement
systems cannot be attributed to the progressive transformation between the
clean and noisy speech signals. Moreover, we show that a proper choice of
preconditioning, training loss weighting, SDE and sampler allows to outperform
a popular diffusion-based speech enhancement system in terms of perceptual
metrics while using fewer sampling steps, thus reducing the computational cost
by a factor of four.
Related papers
- Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation [59.184980778643464]
Fine-tuning Diffusion Models remains an underexplored frontier in generative artificial intelligence (GenAI)
In this paper, we introduce an innovative technique called self-play fine-tuning for diffusion models (SPIN-Diffusion)
Our approach offers an alternative to conventional supervised fine-tuning and RL strategies, significantly improving both model performance and alignment.
arXiv Detail & Related papers (2024-02-15T18:59:18Z) - GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model [0.0]
We propose GLA-Grad, which consists in introducing a phase recovery algorithm such as the Griffin-Lim algorithm (GLA) at each step of the regular diffusion process.
We show that our algorithm outperforms state-of-the-art diffusion models for speech generation, especially when generating speech for a previously unseen target speaker.
arXiv Detail & Related papers (2024-02-09T12:12:52Z) - Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions
Using a Heun-Based Sampler [16.13996677489119]
Diffusion models are a new class of generative models that have recently been applied to speech enhancement successfully.
Previous works have demonstrated their superior performance in mismatched conditions compared to state-of-the art discriminative models.
We show that a proposed system substantially benefits from using multiple databases for training, and achieves superior performance compared to state-of-the-art discriminative models in both matched and mismatched conditions.
arXiv Detail & Related papers (2023-12-05T11:40:38Z) - Unsupervised speech enhancement with diffusion-based generative models [0.0]
We introduce an alternative approach that operates in an unsupervised manner, leveraging the generative power of diffusion models.
We develop a posterior sampling methodology for speech enhancement by combining the learnt clean speech prior with a noise model for speech signal inference.
We show promising results compared to a recent variational auto-encoder (VAE)-based unsupervised approach and a state-of-the-art diffusion-based supervised method.
arXiv Detail & Related papers (2023-09-19T09:11:31Z) - Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image
Captioning [36.4086473737433]
We propose a lightweight image captioning network in combination with continuous diffusion, called Prefix-diffusion.
To achieve diversity, we design an efficient method that injects prefix image embeddings into the denoising process of the diffusion model.
In order to reduce trainable parameters, we employ a pre-trained model to extract image features and further design an extra mapping network.
arXiv Detail & Related papers (2023-09-10T08:55:24Z) - Semantic-Conditional Diffusion Networks for Image Captioning [116.86677915812508]
We propose a new diffusion model based paradigm tailored for image captioning, namely Semantic-Conditional Diffusion Networks (SCD-Net)
In SCD-Net, multiple Diffusion Transformer structures are stacked to progressively strengthen the output sentence with better visional-language alignment and linguistical coherence.
Experiments on COCO dataset demonstrate the promising potential of using diffusion models in the challenging image captioning task.
arXiv Detail & Related papers (2022-12-06T16:08:16Z) - DiffusionBERT: Improving Generative Masked Language Models with
Diffusion Models [81.84866217721361]
DiffusionBERT is a new generative masked language model based on discrete diffusion models.
We propose a new noise schedule for the forward diffusion process that controls the degree of noise added at each step.
Experiments on unconditional text generation demonstrate that DiffusionBERT achieves significant improvement over existing diffusion models for text.
arXiv Detail & Related papers (2022-11-28T03:25:49Z) - Diffusion Models in Vision: A Survey [80.82832715884597]
A diffusion model is a deep generative model that is based on two stages, a forward diffusion stage and a reverse diffusion stage.
Diffusion models are widely appreciated for the quality and diversity of the generated samples, despite their known computational burdens.
arXiv Detail & Related papers (2022-09-10T22:00:30Z) - Conditional Diffusion Probabilistic Model for Speech Enhancement [101.4893074984667]
We propose a novel speech enhancement algorithm that incorporates characteristics of the observed noisy speech signal into the diffusion and reverse processes.
In our experiments, we demonstrate strong performance of the proposed approach compared to representative generative models.
arXiv Detail & Related papers (2022-02-10T18:58:01Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.