A Study on Speech Enhancement Based on Diffusion Probabilistic Model
- URL: http://arxiv.org/abs/2107.11876v1
- Date: Sun, 25 Jul 2021 19:23:18 GMT
- Title: A Study on Speech Enhancement Based on Diffusion Probabilistic Model
- Authors: Yen-Ju Lu, Yu Tsao and Shinji Watanabe
- Abstract summary: We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
- Score: 63.38586161802788
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion probabilistic models have demonstrated an outstanding capability to
model natural images and raw audio waveforms through a paired diffusion and
reverse processes. The unique property of the reverse process (namely,
eliminating non-target signals from the Gaussian noise and noisy signals) could
be utilized to restore clean signals. Based on this property, we propose a
diffusion probabilistic model-based speech enhancement (DiffuSE) model that
aims to recover clean speech signals from noisy signals. The fundamental
architecture of the proposed DiffuSE model is similar to that of DiffWave--a
high-quality audio waveform generation model that has a relatively low
computational cost and footprint. To attain better enhancement performance, we
designed an advanced reverse process, termed the supportive reverse process,
which adds noisy speech in each time-step to the predicted speech. The
experimental results show that DiffuSE yields performance that is comparable to
related audio generative models on the standardized Voice Bank corpus SE task.
Moreover, relative to the generally suggested full sampling schedule, the
proposed supportive reverse process especially improved the fast sampling,
taking few steps to yield better enhancement results over the conventional full
step inference process.
Related papers
- Arbitrary-steps Image Super-resolution via Diffusion Inversion [68.78628844966019]
This study presents a new image super-resolution (SR) technique based on diffusion inversion, aiming at harnessing the rich image priors encapsulated in large pre-trained diffusion models to improve SR performance.
We design a Partial noise Prediction strategy to construct an intermediate state of the diffusion model, which serves as the starting sampling point.
Once trained, this noise predictor can be used to initialize the sampling process partially along the diffusion trajectory, generating the desirable high-resolution result.
arXiv Detail & Related papers (2024-12-12T07:24:13Z) - Diffusion-based Unsupervised Audio-visual Speech Enhancement [26.937216751657697]
This paper proposes a new unsupervised audio-visual speech enhancement (AVSE) approach.
It combines a diffusion-based audio-visual speech generative model with a non-negative matrix factorization (NMF) noise model.
Experimental results confirm that the proposed AVSE approach not only outperforms its audio-only counterpart but also generalizes better than a recent supervised-generative AVSE method.
arXiv Detail & Related papers (2024-10-04T12:22:54Z) - Improved Noise Schedule for Diffusion Training [51.849746576387375]
We propose a novel approach to design the noise schedule for enhancing the training of diffusion models.
We empirically demonstrate the superiority of our noise schedule over the standard cosine schedule.
arXiv Detail & Related papers (2024-07-03T17:34:55Z) - Adversarial Training of Denoising Diffusion Model Using Dual
Discriminators for High-Fidelity Multi-Speaker TTS [0.0]
The diffusion model is capable of generating high-quality data through a probabilistic approach.
It suffers from the drawback of slow generation speed due to the requirement of a large number of time steps.
We propose a speech synthesis model with two discriminators: a diffusion discriminator for learning the distribution of the reverse process and a spectrogram discriminator for learning the distribution of the generated data.
arXiv Detail & Related papers (2023-08-03T07:22:04Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - Speech Enhancement and Dereverberation with Diffusion-based Generative
Models [14.734454356396157]
We present a detailed overview of the diffusion process that is based on a differential equation.
We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates.
In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models.
arXiv Detail & Related papers (2022-08-11T13:55:12Z) - Conditional Diffusion Probabilistic Model for Speech Enhancement [101.4893074984667]
We propose a novel speech enhancement algorithm that incorporates characteristics of the observed noisy speech signal into the diffusion and reverse processes.
In our experiments, we demonstrate strong performance of the proposed approach compared to representative generative models.
arXiv Detail & Related papers (2022-02-10T18:58:01Z) - CRASH: Raw Audio Score-based Generative Modeling for Controllable
High-resolution Drum Sound Synthesis [0.0]
We propose a novel score-base generative model for unconditional raw audio synthesis.
Our proposed method closes the gap with GAN-based methods on raw audio, while offering more flexible generation capabilities.
arXiv Detail & Related papers (2021-06-14T13:48:03Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.