Speech Enhancement with Score-Based Generative Models in the Complex
STFT Domain
- URL: http://arxiv.org/abs/2203.17004v1
- Date: Thu, 31 Mar 2022 12:53:47 GMT
- Title: Speech Enhancement with Score-Based Generative Models in the Complex
STFT Domain
- Authors: Simon Welker, Julius Richter, Timo Gerkmann
- Abstract summary: We propose a novel training task for speech enhancement using a complex-valued deep neural network.
We derive this training task within the formalism of differential equations, thereby enabling the use of predictor-corrector samplers.
- Score: 18.090665052145653
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Score-based generative models (SGMs) have recently shown impressive results
for difficult generative tasks such as the unconditional and conditional
generation of natural images and audio signals. In this work, we extend these
models to the complex short-time Fourier transform (STFT) domain, proposing a
novel training task for speech enhancement using a complex-valued deep neural
network. We derive this training task within the formalism of stochastic
differential equations, thereby enabling the use of predictor-corrector
samplers. We provide alternative formulations inspired by previous publications
on using SGMs for speech enhancement, avoiding the need for any prior
assumptions on the noise distribution and making the training task purely
generative which, as we show, results in improved enhancement performance.
Related papers
- DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval [49.076590578101985]
We present a diffusion-based ATR framework (DiffATR) that generates joint distribution from noise.
Experiments on the AudioCaps and Clotho datasets with superior performances, verify the effectiveness of our approach.
arXiv Detail & Related papers (2024-09-16T06:33:26Z) - Generative Pre-training for Speech with Flow Matching [81.59952572752248]
We pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions.
Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis.
arXiv Detail & Related papers (2023-10-25T03:40:50Z) - Controlled Randomness Improves the Performance of Transformer Models [4.678970068275123]
We introduce controlled randomness, i.e. noise, into the training process to improve fine-tuning language models.
We find that adding such noise can improve the performance in our two downstream tasks of joint named entity recognition and relation extraction and text summarization.
arXiv Detail & Related papers (2023-10-20T14:12:55Z) - A weighted-variance variational autoencoder model for speech enhancement [0.0]
We propose a weighted variance generative model, where the contribution of each spectrogram time-frame in parameter learning is weighted.
We develop efficient training and speech enhancement algorithms based on the proposed generative model.
arXiv Detail & Related papers (2022-11-02T09:51:15Z) - Period VITS: Variational Inference with Explicit Pitch Modeling for
End-to-end Emotional Speech Synthesis [19.422230767803246]
We propose Period VITS, a novel end-to-end text-to-speech model that incorporates an explicit periodicity generator.
In the proposed method, we introduce a frame pitch predictor that predicts prosodic features, such as pitch and voicing flags, from the input text.
From these features, the proposed periodicity generator produces a sample-level sinusoidal source that enables the waveform decoder to accurately reproduce the pitch.
arXiv Detail & Related papers (2022-10-28T07:52:30Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Self-supervised Pre-training with Hard Examples Improves Visual
Representations [110.23337264762512]
Self-supervised pre-training (SSP) employs random image transformations to generate training data for visual representation learning.
We first present a modeling framework that unifies existing SSP methods as learning to predict pseudo-labels.
Then, we propose new data augmentation methods of generating training examples whose pseudo-labels are harder to predict than those generated via random image transformations.
arXiv Detail & Related papers (2020-12-25T02:44:22Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Generating diverse and natural text-to-speech samples using a quantized
fine-grained VAE and auto-regressive prosody prior [53.69310441063162]
This paper proposes a sequential prior in a discrete latent space which can generate more naturally sounding samples.
We evaluate the approach using listening tests, objective metrics of automatic speech recognition (ASR) performance, and measurements of prosody attributes.
arXiv Detail & Related papers (2020-02-06T12:35:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.