Restoring degraded speech via a modified diffusion model
- URL: http://arxiv.org/abs/2104.11347v1
- Date: Thu, 22 Apr 2021 23:03:23 GMT
- Title: Restoring degraded speech via a modified diffusion model
- Authors: Jianwei Zhang, Suren Jayasuriya, Visar Berisha
- Abstract summary: We introduce a neural network architecture, based on a modification of the DiffWave model, that aims to restore the original speech signal.
We replace the mel-spectrum upsampler in DiffWave with a deep CNN upsampler, which is trained to alter the degraded speech mel-spectrum to match that of the original speech.
Our model results in improved speech quality (original DiffWave model as baseline) on several different experiments.
- Score: 28.90259510094427
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There are many deterministic mathematical operations (e.g. compression,
clipping, downsampling) that degrade speech quality considerably. In this paper
we introduce a neural network architecture, based on a modification of the
DiffWave model, that aims to restore the original speech signal. DiffWave, a
recently published diffusion-based vocoder, has shown state-of-the-art
synthesized speech quality and relatively shorter waveform generation times,
with only a small set of parameters. We replace the mel-spectrum upsampler in
DiffWave with a deep CNN upsampler, which is trained to alter the degraded
speech mel-spectrum to match that of the original speech. The model is trained
using the original speech waveform, but conditioned on the degraded speech
mel-spectrum. Post-training, only the degraded mel-spectrum is used as input
and the model generates an estimate of the original speech. Our model results
in improved speech quality (original DiffWave model as baseline) on several
different experiments. These include improving the quality of speech degraded
by LPC-10 compression, AMR-NB compression, and signal clipping. Compared to the
original DiffWave architecture, our scheme achieves better performance on
several objective perceptual metrics and in subjective comparisons.
Improvements over baseline are further amplified in a out-of-corpus evaluation
setting.
Related papers
- DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform
Generation [25.968115316199246]
This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform.
Our model is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one.
Experiments show that the proposed model generates speech with superior quality compared with other state-of-the-art neural speech generation systems.
arXiv Detail & Related papers (2023-10-02T17:42:22Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Audio-Visual Speech Enhancement with Score-Based Generative Models [22.559617939136505]
This paper introduces an audio-visual speech enhancement system that leverages score-based generative models.
We exploit audio-visual embeddings obtained from a self-super-vised learning model that has been fine-tuned on lipreading.
Experimental evaluations show that the proposed audio-visual speech enhancement system yields improved speech quality.
arXiv Detail & Related papers (2023-06-02T10:43:42Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - WaveGrad: Estimating Gradients for Waveform Generation [55.405580817560754]
WaveGrad is a conditional model for waveform generation which estimates gradients of the data density.
It starts from a Gaussian white noise signal and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram.
We find that it can generate high fidelity audio samples using as few as six iterations.
arXiv Detail & Related papers (2020-09-02T17:44:10Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z) - Single Channel Speech Enhancement Using Temporal Convolutional Recurrent
Neural Networks [23.88788382262305]
temporal convolutional recurrent network (TCRN) is an end-to-end model that directly map noisy waveform to clean waveform.
We show that our model is able to improve the performance of model, compared with existing convolutional recurrent networks.
arXiv Detail & Related papers (2020-02-02T04:26:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.