Training Speech Enhancement Systems with Noisy Speech Datasets
- URL: http://arxiv.org/abs/2105.12315v1
- Date: Wed, 26 May 2021 03:32:39 GMT
- Title: Training Speech Enhancement Systems with Noisy Speech Datasets
- Authors: Koichi Saito, Stefan Uhlich, Giorgio Fabbro, Yuki Mitsufuji
- Abstract summary: We propose two improvements to train SE systems on noisy speech data.
First, we propose several modifications of the loss functions, which make them robust against noisy speech targets.
We show that using our robust loss function improves PESQ by up to 0.19 compared to a system trained in the traditional way.
- Score: 7.157870452667369
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recently, deep neural network (DNN)-based speech enhancement (SE) systems
have been used with great success. During training, such systems require clean
speech data - ideally, in large quantity with a variety of acoustic conditions,
many different speaker characteristics and for a given sampling rate (e.g.,
48kHz for fullband SE). However, obtaining such clean speech data is not
straightforward - especially, if only considering publicly available datasets.
At the same time, a lot of material for automatic speech recognition (ASR) with
the desired acoustic/speaker/sampling rate characteristics is publicly
available except being clean, i.e., it also contains background noise as this
is even often desired in order to have ASR systems that are noise-robust.
Hence, using such data to train SE systems is not straightforward. In this
paper, we propose two improvements to train SE systems on noisy speech data.
First, we propose several modifications of the loss functions, which make them
robust against noisy speech targets. In particular, computing the median over
the sample axis before averaging over time-frequency bins allows to use such
data. Furthermore, we propose a noise augmentation scheme for mixture-invariant
training (MixIT), which allows using it also in such scenarios. For our
experiments, we use the Mozilla Common Voice dataset and we show that using our
robust loss function improves PESQ by up to 0.19 compared to a system trained
in the traditional way. Similarly, for MixIT we can see an improvement of up to
0.27 in PESQ when using our proposed noise augmentation.
Related papers
- Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space [10.875499903992782]
We conduct a set of experiments around zero-shot learning with synthetic speech data for the specific task of speech commands classification.
Our results on the Google Speech Commands dataset show that a simple ASR-based filtering method can have a big impact in the quality of the generated data.
Despite the good quality of the generated speech data, we also show that synthetic and real speech can still be easily distinguishable when using self-supervised (WavLM) features.
arXiv Detail & Related papers (2024-09-19T13:07:55Z) - Quartered Spectral Envelope and 1D-CNN-based Classification of Normally Phonated and Whispered Speech [0.0]
The presence of pitch and pitch harmonics in normal speech, and its absence in whispered speech, is evident in the spectral envelope of the Fourier transform.
We propose the use of one dimensional convolutional neural networks (1D-CNN) to capture these features.
The system yields an accuracy of 99.31% when trained and tested on the wTIMIT dataset, and 100% on the CHAINS dataset.
arXiv Detail & Related papers (2024-08-25T07:17:11Z) - Unifying Speech Enhancement and Separation with Gradient Modulation for
End-to-End Noise-Robust Speech Separation [23.758202121043805]
We propose a novel network to unify speech enhancement and separation with gradient modulation to improve noise-robustness.
Experimental results show that our approach achieves the state-of-the-art on large-scale Libri2Mix- and Libri3Mix-noisy datasets.
arXiv Detail & Related papers (2023-02-22T03:54:50Z) - ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition [100.30565531246165]
Speech recognition systems require dataset-specific tuning.
This tuning requirement can lead to systems failing to generalise to other datasets and domains.
We introduce the End-to-end Speech Benchmark (ESB) for evaluating the performance of a single automatic speech recognition system.
arXiv Detail & Related papers (2022-10-24T15:58:48Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile
Application [63.2243126704342]
This study presents a deep learning-based speech signal-processing mobile application known as CITISEN.
The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC)
Compared with the noisy speech signals, the enhanced speech signals achieved about 6% and 33% of improvements.
arXiv Detail & Related papers (2020-08-21T02:04:12Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z) - CURE Dataset: Ladder Networks for Audio Event Classification [15.850545634216484]
There are approximately 3M people with hearing loss who can't perceive events happening around them.
This paper establishes the CURE dataset which contains curated set of specific audio events most relevant for people with hearing loss.
arXiv Detail & Related papers (2020-01-12T09:35:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.