Universal Speech Enhancement with Score-based Diffusion
- URL: http://arxiv.org/abs/2206.03065v1
- Date: Tue, 7 Jun 2022 07:32:32 GMT
- Title: Universal Speech Enhancement with Score-based Diffusion
- Authors: Joan Serr\`a, Santiago Pascual, Jordi Pons, R. Oguz Araz, Davide
Scaini
- Abstract summary: We present a universal speech enhancement system that tackles 55 different distortions at the same time.
Our approach consists of a generative model that employs score-based diffusion, together with a multi-resolution conditioning network.
We show that this approach significantly outperforms the state of the art in a subjective test performed by expert listeners.
- Score: 21.294665965300922
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Removing background noise from speech audio has been the subject of
considerable research and effort, especially in recent years due to the rise of
virtual communication and amateur sound recording. Yet background noise is not
the only unpleasant disturbance that can prevent intelligibility: reverb,
clipping, codec artifacts, problematic equalization, limited bandwidth, or
inconsistent loudness are equally disturbing and ubiquitous. In this work, we
propose to consider the task of speech enhancement as a holistic endeavor, and
present a universal speech enhancement system that tackles 55 different
distortions at the same time. Our approach consists of a generative model that
employs score-based diffusion, together with a multi-resolution conditioning
network that performs enhancement with mixture density networks. We show that
this approach significantly outperforms the state of the art in a subjective
test performed by expert listeners. We also show that it achieves competitive
objective scores with just 4-8 diffusion steps, despite not considering any
particular strategy for fast sampling. We hope that both our methodology and
technical contributions encourage researchers and practitioners to adopt a
universal approach to speech enhancement, possibly framing it as a generative
task.
Related papers
- FINALLY: fast and universal speech enhancement with studio-like quality [7.207284147264852]
We address the challenge of speech enhancement in real-world recordings, which often contain various forms of distortion.
We study various feature extractors for perceptual loss to facilitate the stability of adversarial training.
We integrate WavLM-based perceptual loss into MS-STFT adversarial training pipeline, creating an effective and stable training procedure for the speech enhancement model.
arXiv Detail & Related papers (2024-10-08T11:16:03Z) - TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition [29.756961194844717]
The proposed TRNet substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments.
Results validate that the proposed system substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments.
arXiv Detail & Related papers (2024-04-19T16:09:17Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Analysing Diffusion-based Generative Approaches versus Discriminative
Approaches for Speech Restoration [16.09633286837904]
We systematically compare the performance of generative diffusion models and discriminative approaches on different speech restoration tasks.
We observe that the generative approach performs globally better than its discriminative counterpart on all tasks.
arXiv Detail & Related papers (2022-11-04T12:06:14Z) - Speech Enhancement and Dereverberation with Diffusion-based Generative
Models [14.734454356396157]
We present a detailed overview of the diffusion process that is based on a differential equation.
We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates.
In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models.
arXiv Detail & Related papers (2022-08-11T13:55:12Z) - Improving Distortion Robustness of Self-supervised Speech Processing
Tasks with Domain Adaptation [60.26511271597065]
Speech distortions are a long-standing problem that degrades the performance of supervisely trained speech processing models.
It is high time that we enhance the robustness of speech processing models to obtain good performance when encountering speech distortions.
arXiv Detail & Related papers (2022-03-30T07:25:52Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z) - Towards Robust Speech-to-Text Adversarial Attack [78.5097679815944]
This paper introduces a novel adversarial algorithm for attacking the state-of-the-art speech-to-text systems, namely DeepSpeech, Kaldi, and Lingvo.
Our approach is based on developing an extension for the conventional distortion condition of the adversarial optimization formulation.
Minimizing over this metric, which measures the discrepancies between original and adversarial samples' distributions, contributes to crafting signals very close to the subspace of legitimate speech recordings.
arXiv Detail & Related papers (2021-03-15T01:51:41Z) - High Fidelity Speech Regeneration with Application to Speech Enhancement [96.34618212590301]
We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner.
Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source.
arXiv Detail & Related papers (2021-01-31T10:54:27Z) - HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech
Deep Features in Adversarial Networks [29.821666380496637]
HiFi-GAN transforms recorded speech to sound as though it had been recorded in a studio.
It relies on the deep feature matching losses of the discriminators to improve the perceptual quality of enhanced speech.
It significantly outperforms state-of-the-art baseline methods in both objective and subjective experiments.
arXiv Detail & Related papers (2020-06-10T07:24:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.