Related papers: A Study on Speech Enhancement Based on Diffusion Probabilistic Model

A Study on Speech Enhancement Based on Diffusion Probabilistic Model

URL: http://arxiv.org/abs/2107.11876v1
Date: Sun, 25 Jul 2021 19:23:18 GMT
Title: A Study on Speech Enhancement Based on Diffusion Probabilistic Model
Authors: Yen-Ju Lu, Yu Tsao and Shinji Watanabe
Abstract summary: We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals. The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
Score: 63.38586161802788
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion probabilistic models have demonstrated an outstanding capability to model natural images and raw audio waveforms through a paired diffusion and reverse processes. The unique property of the reverse process (namely, eliminating non-target signals from the Gaussian noise and noisy signals) could be utilized to restore clean signals. Based on this property, we propose a diffusion probabilistic model-based speech enhancement (DiffuSE) model that aims to recover clean speech signals from noisy signals. The fundamental architecture of the proposed DiffuSE model is similar to that of DiffWave--a high-quality audio waveform generation model that has a relatively low computational cost and footprint. To attain better enhancement performance, we designed an advanced reverse process, termed the supportive reverse process, which adds noisy speech in each time-step to the predicted speech. The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus SE task. Moreover, relative to the generally suggested full sampling schedule, the proposed supportive reverse process especially improved the fast sampling, taking few steps to yield better enhancement results over the conventional full step inference process.

Related papers

Arbitrary-steps Image Super-resolution via Diffusion Inversion [68.78628844966019]
This study presents a new image super-resolution (SR) technique based on diffusion inversion, aiming at harnessing the rich image priors encapsulated in large pre-trained diffusion models to improve SR performance. We design a Partial noise Prediction strategy to construct an intermediate state of the diffusion model, which serves as the starting sampling point. Once trained, this noise predictor can be used to initialize the sampling process partially along the diffusion trajectory, generating the desirable high-resolution result.
arXiv Detail & Related papers (2024-12-12T07:24:13Z)
Diffusion-based Unsupervised Audio-visual Speech Enhancement [26.937216751657697]
This paper proposes a new unsupervised audiovisual speech enhancement (AVSE) approach. It combines a diffusion-based audio-visual speech generative model with a non-negative matrix factorization (NMF) noise model. Experimental results confirm that the proposed AVSE approach not only outperforms its audio-only counterpart but also generalizes better than a recent supervisedgenerative AVSE method.
arXiv Detail & Related papers (2024-10-04T12:22:54Z)
Improved Noise Schedule for Diffusion Training [51.849746576387375]
We propose a novel approach to design the noise schedule for enhancing the training of diffusion models. We empirically demonstrate the superiority of our noise schedule over the standard cosine schedule.
arXiv Detail & Related papers (2024-07-03T17:34:55Z)
Adversarial Training of Denoising Diffusion Model Using Dual Discriminators for High-Fidelity Multi-Speaker TTS [0.0]
The diffusion model is capable of generating high-quality data through a probabilistic approach. It suffers from the drawback of slow generation speed due to the requirement of a large number of time steps. We propose a speech synthesis model with two discriminators: a diffusion discriminator for learning the distribution of the reverse process and a spectrogram discriminator for learning the distribution of the generated data.
arXiv Detail & Related papers (2023-08-03T07:22:04Z)
Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality. To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches. Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z)
Speech Enhancement and Dereverberation with Diffusion-based Generative Models [14.734454356396157]
We present a detailed overview of the diffusion process that is based on a differential equation. We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates. In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models.
arXiv Detail & Related papers (2022-08-11T13:55:12Z)
ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech [63.780196620966905]
We propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech. ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling. Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms. ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2022-07-13T17:45:43Z)
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis [90.3069686272524]
This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis. FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies. Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms.
arXiv Detail & Related papers (2022-04-21T07:49:09Z)
Conditional Diffusion Probabilistic Model for Speech Enhancement [101.4893074984667]
We propose a novel speech enhancement algorithm that incorporates characteristics of the observed noisy speech signal into the diffusion and reverse processes. In our experiments, we demonstrate strong performance of the proposed approach compared to representative generative models.
arXiv Detail & Related papers (2022-02-10T18:58:01Z)
CRASH: Raw Audio Score-based Generative Modeling for Controllable High-resolution Drum Sound Synthesis [0.0]
We propose a novel score-base generative model for unconditional raw audio synthesis. Our proposed method closes the gap with GAN-based methods on raw audio, while offering more flexible generation capabilities.
arXiv Detail & Related papers (2021-06-14T13:48:03Z)
Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with skip-connections. It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.