Training Speech Enhancement Systems with Noisy Speech Datasets
- URL: http://arxiv.org/abs/2105.12315v1
- Date: Wed, 26 May 2021 03:32:39 GMT
- Title: Training Speech Enhancement Systems with Noisy Speech Datasets
- Authors: Koichi Saito, Stefan Uhlich, Giorgio Fabbro, Yuki Mitsufuji
- Abstract summary: We propose two improvements to train SE systems on noisy speech data.
First, we propose several modifications of the loss functions, which make them robust against noisy speech targets.
We show that using our robust loss function improves PESQ by up to 0.19 compared to a system trained in the traditional way.
- Score: 7.157870452667369
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recently, deep neural network (DNN)-based speech enhancement (SE) systems
have been used with great success. During training, such systems require clean
speech data - ideally, in large quantity with a variety of acoustic conditions,
many different speaker characteristics and for a given sampling rate (e.g.,
48kHz for fullband SE). However, obtaining such clean speech data is not
straightforward - especially, if only considering publicly available datasets.
At the same time, a lot of material for automatic speech recognition (ASR) with
the desired acoustic/speaker/sampling rate characteristics is publicly
available except being clean, i.e., it also contains background noise as this
is even often desired in order to have ASR systems that are noise-robust.
Hence, using such data to train SE systems is not straightforward. In this
paper, we propose two improvements to train SE systems on noisy speech data.
First, we propose several modifications of the loss functions, which make them
robust against noisy speech targets. In particular, computing the median over
the sample axis before averaging over time-frequency bins allows to use such
data. Furthermore, we propose a noise augmentation scheme for mixture-invariant
training (MixIT), which allows using it also in such scenarios. For our
experiments, we use the Mozilla Common Voice dataset and we show that using our
robust loss function improves PESQ by up to 0.19 compared to a system trained
in the traditional way. Similarly, for MixIT we can see an improvement of up to
0.27 in PESQ when using our proposed noise augmentation.
Related papers
- Unifying Speech Enhancement and Separation with Gradient Modulation for
End-to-End Noise-Robust Speech Separation [23.758202121043805]
We propose a novel network to unify speech enhancement and separation with gradient modulation to improve noise-robustness.
Experimental results show that our approach achieves the state-of-the-art on large-scale Libri2Mix- and Libri3Mix-noisy datasets.
arXiv Detail & Related papers (2023-02-22T03:54:50Z) - ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition [100.30565531246165]
Speech recognition systems require dataset-specific tuning.
This tuning requirement can lead to systems failing to generalise to other datasets and domains.
We introduce the End-to-end Speech Benchmark (ESB) for evaluating the performance of a single automatic speech recognition system.
arXiv Detail & Related papers (2022-10-24T15:58:48Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Building a Noisy Audio Dataset to Evaluate Machine Learning Approaches
for Automatic Speech Recognition Systems [0.0]
This work aims to present the process of building a dataset of noisy audios, in a specific case of degenerated audios due to interference.
We also present initial results of a classifier that uses such data for evaluation, indicating the benefits of using this dataset in the recognizer's training process.
arXiv Detail & Related papers (2021-10-04T13:08:53Z) - CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile
Application [63.2243126704342]
This study presents a deep learning-based speech signal-processing mobile application known as CITISEN.
The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC)
Compared with the noisy speech signals, the enhanced speech signals achieved about 6% and 33% of improvements.
arXiv Detail & Related papers (2020-08-21T02:04:12Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z) - Adversarial Feature Learning and Unsupervised Clustering based Speech
Synthesis for Found Data with Acoustic and Textual Noise [18.135965605011105]
Attention-based sequence-to-sequence (seq2seq) speech synthesis has achieved extraordinary performance.
A studio-quality corpus with manual transcription is necessary to train such seq2seq systems.
We propose an approach to build high-quality and stable seq2seq based speech synthesis system using challenging found data.
arXiv Detail & Related papers (2020-04-28T15:32:45Z) - CURE Dataset: Ladder Networks for Audio Event Classification [15.850545634216484]
There are approximately 3M people with hearing loss who can't perceive events happening around them.
This paper establishes the CURE dataset which contains curated set of specific audio events most relevant for people with hearing loss.
arXiv Detail & Related papers (2020-01-12T09:35:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.