Universal Speech Enhancement with Score-based Diffusion
- URL: http://arxiv.org/abs/2206.03065v1
- Date: Tue, 7 Jun 2022 07:32:32 GMT
- Title: Universal Speech Enhancement with Score-based Diffusion
- Authors: Joan Serr\`a, Santiago Pascual, Jordi Pons, R. Oguz Araz, Davide
Scaini
- Abstract summary: We present a universal speech enhancement system that tackles 55 different distortions at the same time.
Our approach consists of a generative model that employs score-based diffusion, together with a multi-resolution conditioning network.
We show that this approach significantly outperforms the state of the art in a subjective test performed by expert listeners.
- Score: 21.294665965300922
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Removing background noise from speech audio has been the subject of
considerable research and effort, especially in recent years due to the rise of
virtual communication and amateur sound recording. Yet background noise is not
the only unpleasant disturbance that can prevent intelligibility: reverb,
clipping, codec artifacts, problematic equalization, limited bandwidth, or
inconsistent loudness are equally disturbing and ubiquitous. In this work, we
propose to consider the task of speech enhancement as a holistic endeavor, and
present a universal speech enhancement system that tackles 55 different
distortions at the same time. Our approach consists of a generative model that
employs score-based diffusion, together with a multi-resolution conditioning
network that performs enhancement with mixture density networks. We show that
this approach significantly outperforms the state of the art in a subjective
test performed by expert listeners. We also show that it achieves competitive
objective scores with just 4-8 diffusion steps, despite not considering any
particular strategy for fast sampling. We hope that both our methodology and
technical contributions encourage researchers and practitioners to adopt a
universal approach to speech enhancement, possibly framing it as a generative
task.
Related papers
- TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition [29.756961194844717]
Speech Emotion Recognition (SER) is subject to ubiquitous environmental noise.
We introduce a Two-level Refinement Network, dubbed TRNet, to address this challenge.
We show that TRNet substantially increases the system's robustness in both matched and unmatched noisy environments.
arXiv Detail & Related papers (2024-04-19T16:09:17Z) - Diffusion Conditional Expectation Model for Efficient and Robust Target
Speech Extraction [73.43534824551236]
We propose an efficient generative approach named Conditional Diffusion Expectation Model (DCEM) for Target Speech Extraction (TSE)
It can handle multi- and single-speaker scenarios in both noisy and clean conditions.
Our method outperforms conventional methods in terms of both intrusive and non-intrusive metrics.
arXiv Detail & Related papers (2023-09-25T04:58:38Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation [41.292644854306594]
We propose a novel diffusion-based framework, named Diffusion Co-Speech Gesture (DiffGesture)
DiffGesture achieves state-of-theart performance, which renders coherent gestures with better mode coverage and stronger audio correlations.
arXiv Detail & Related papers (2023-03-16T07:32:31Z) - Analysing Diffusion-based Generative Approaches versus Discriminative
Approaches for Speech Restoration [16.09633286837904]
We systematically compare the performance of generative diffusion models and discriminative approaches on different speech restoration tasks.
We observe that the generative approach performs globally better than its discriminative counterpart on all tasks.
arXiv Detail & Related papers (2022-11-04T12:06:14Z) - Speech Enhancement and Dereverberation with Diffusion-based Generative
Models [14.734454356396157]
We present a detailed overview of the diffusion process that is based on a differential equation.
We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates.
In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models.
arXiv Detail & Related papers (2022-08-11T13:55:12Z) - Improving Distortion Robustness of Self-supervised Speech Processing
Tasks with Domain Adaptation [60.26511271597065]
Speech distortions are a long-standing problem that degrades the performance of supervisely trained speech processing models.
It is high time that we enhance the robustness of speech processing models to obtain good performance when encountering speech distortions.
arXiv Detail & Related papers (2022-03-30T07:25:52Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z) - Towards Robust Speech-to-Text Adversarial Attack [78.5097679815944]
This paper introduces a novel adversarial algorithm for attacking the state-of-the-art speech-to-text systems, namely DeepSpeech, Kaldi, and Lingvo.
Our approach is based on developing an extension for the conventional distortion condition of the adversarial optimization formulation.
Minimizing over this metric, which measures the discrepancies between original and adversarial samples' distributions, contributes to crafting signals very close to the subspace of legitimate speech recordings.
arXiv Detail & Related papers (2021-03-15T01:51:41Z) - High Fidelity Speech Regeneration with Application to Speech Enhancement [96.34618212590301]
We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner.
Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source.
arXiv Detail & Related papers (2021-01-31T10:54:27Z) - HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech
Deep Features in Adversarial Networks [29.821666380496637]
HiFi-GAN transforms recorded speech to sound as though it had been recorded in a studio.
It relies on the deep feature matching losses of the discriminators to improve the perceptual quality of enhanced speech.
It significantly outperforms state-of-the-art baseline methods in both objective and subjective experiments.
arXiv Detail & Related papers (2020-06-10T07:24:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.