A Study on Speech Enhancement Based on Diffusion Probabilistic Model
- URL: http://arxiv.org/abs/2107.11876v1
- Date: Sun, 25 Jul 2021 19:23:18 GMT
- Title: A Study on Speech Enhancement Based on Diffusion Probabilistic Model
- Authors: Yen-Ju Lu, Yu Tsao and Shinji Watanabe
- Abstract summary: We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
- Score: 63.38586161802788
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion probabilistic models have demonstrated an outstanding capability to
model natural images and raw audio waveforms through a paired diffusion and
reverse processes. The unique property of the reverse process (namely,
eliminating non-target signals from the Gaussian noise and noisy signals) could
be utilized to restore clean signals. Based on this property, we propose a
diffusion probabilistic model-based speech enhancement (DiffuSE) model that
aims to recover clean speech signals from noisy signals. The fundamental
architecture of the proposed DiffuSE model is similar to that of DiffWave--a
high-quality audio waveform generation model that has a relatively low
computational cost and footprint. To attain better enhancement performance, we
designed an advanced reverse process, termed the supportive reverse process,
which adds noisy speech in each time-step to the predicted speech. The
experimental results show that DiffuSE yields performance that is comparable to
related audio generative models on the standardized Voice Bank corpus SE task.
Moreover, relative to the generally suggested full sampling schedule, the
proposed supportive reverse process especially improved the fast sampling,
taking few steps to yield better enhancement results over the conventional full
step inference process.
Related papers
- Diffusion Conditional Expectation Model for Efficient and Robust Target
Speech Extraction [73.43534824551236]
We propose an efficient generative approach named Conditional Diffusion Expectation Model (DCEM) for Target Speech Extraction (TSE)
It can handle multi- and single-speaker scenarios in both noisy and clean conditions.
Our method outperforms conventional methods in terms of both intrusive and non-intrusive metrics.
arXiv Detail & Related papers (2023-09-25T04:58:38Z) - Unsupervised speech enhancement with diffusion-based generative models [0.0]
We introduce an alternative approach that operates in an unsupervised manner, leveraging the generative power of diffusion models.
We develop a posterior sampling methodology for speech enhancement by combining the learnt clean speech prior with a noise model for speech signal inference.
We show promising results compared to a recent variational auto-encoder (VAE)-based unsupervised approach and a state-of-the-art diffusion-based supervised method.
arXiv Detail & Related papers (2023-09-19T09:11:31Z) - Adversarial Training of Denoising Diffusion Model Using Dual
Discriminators for High-Fidelity Multi-Speaker TTS [0.0]
The diffusion model is capable of generating high-quality data through a probabilistic approach.
It suffers from the drawback of slow generation speed due to the requirement of a large number of time steps.
We propose a speech synthesis model with two discriminators: a diffusion discriminator for learning the distribution of the reverse process and a spectrogram discriminator for learning the distribution of the generated data.
arXiv Detail & Related papers (2023-08-03T07:22:04Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - Speech Enhancement and Dereverberation with Diffusion-based Generative
Models [14.734454356396157]
We present a detailed overview of the diffusion process that is based on a differential equation.
We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates.
In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models.
arXiv Detail & Related papers (2022-08-11T13:55:12Z) - ProDiff: Progressive Fast Diffusion Model For High-Quality
Text-to-Speech [63.780196620966905]
We propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech.
ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling.
Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms.
ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2022-07-13T17:45:43Z) - Conditional Diffusion Probabilistic Model for Speech Enhancement [101.4893074984667]
We propose a novel speech enhancement algorithm that incorporates characteristics of the observed noisy speech signal into the diffusion and reverse processes.
In our experiments, we demonstrate strong performance of the proposed approach compared to representative generative models.
arXiv Detail & Related papers (2022-02-10T18:58:01Z) - CRASH: Raw Audio Score-based Generative Modeling for Controllable
High-resolution Drum Sound Synthesis [0.0]
We propose a novel score-base generative model for unconditional raw audio synthesis.
Our proposed method closes the gap with GAN-based methods on raw audio, while offering more flexible generation capabilities.
arXiv Detail & Related papers (2021-06-14T13:48:03Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.