Real Time Speech Enhancement in the Waveform Domain
- URL: http://arxiv.org/abs/2006.12847v3
- Date: Sun, 6 Sep 2020 14:32:59 GMT
- Title: Real Time Speech Enhancement in the Waveform Domain
- Authors: Alexandre Defossez, Gabriel Synnaeve, Yossi Adi
- Abstract summary: We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
- Score: 99.02180506016721
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a causal speech enhancement model working on the raw waveform that
runs in real-time on a laptop CPU. The proposed model is based on an
encoder-decoder architecture with skip-connections. It is optimized on both
time and frequency domains, using multiple loss functions. Empirical evidence
shows that it is capable of removing various kinds of background noise
including stationary and non-stationary noises, as well as room reverb.
Additionally, we suggest a set of data augmentation techniques applied directly
on the raw waveform which further improve model performance and its
generalization abilities. We perform evaluations on several standard
benchmarks, both using objective metrics and human judgements. The proposed
model matches state-of-the-art performance of both causal and non causal
methods while working directly on the raw waveform.
Related papers
- Audio Decoding by Inverse Problem Solving [1.0612107014404766]
We consider audio decoding as an inverse problem and solve it through diffusion posterior sampling.
Explicit conditioning functions are developed for signal measurements provided by an example of a transform domain perceptual audio.
arXiv Detail & Related papers (2024-09-12T09:05:18Z) - DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform
Generation [25.968115316199246]
This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform.
Our model is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one.
Experiments show that the proposed model generates speech with superior quality compared with other state-of-the-art neural speech generation systems.
arXiv Detail & Related papers (2023-10-02T17:42:22Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - Separate And Diffuse: Using a Pretrained Diffusion Model for Improving
Source Separation [99.19786288094596]
We show how the upper bound can be generalized to the case of random generative models.
We show state-of-the-art results on 2, 3, 5, 10, and 20 speakers on multiple benchmarks.
arXiv Detail & Related papers (2023-01-25T18:21:51Z) - Speech Denoising in the Waveform Domain with Self-Attention [27.84933221217885]
We present CleanUNet, a causal speech denoising model on the raw waveform.
The proposed model is based on an encoder-decoder architecture combined with several self-attention blocks to refine its bottleneck representations.
arXiv Detail & Related papers (2022-02-15T23:44:02Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Voice2Series: Reprogramming Acoustic Models for Time Series
Classification [65.94154001167608]
Voice2Series is a novel end-to-end approach that reprograms acoustic models for time series classification.
We show that V2S either outperforms or is tied with state-of-the-art methods on 20 tasks, and improves their average accuracy by 1.84%.
arXiv Detail & Related papers (2021-06-17T07:59:15Z) - Restoring degraded speech via a modified diffusion model [28.90259510094427]
We introduce a neural network architecture, based on a modification of the DiffWave model, that aims to restore the original speech signal.
We replace the mel-spectrum upsampler in DiffWave with a deep CNN upsampler, which is trained to alter the degraded speech mel-spectrum to match that of the original speech.
Our model results in improved speech quality (original DiffWave model as baseline) on several different experiments.
arXiv Detail & Related papers (2021-04-22T23:03:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.