Building a Noisy Audio Dataset to Evaluate Machine Learning Approaches
for Automatic Speech Recognition Systems
- URL: http://arxiv.org/abs/2110.01425v1
- Date: Mon, 4 Oct 2021 13:08:53 GMT
- Title: Building a Noisy Audio Dataset to Evaluate Machine Learning Approaches
for Automatic Speech Recognition Systems
- Authors: Julio Cesar Duarte and S\'ergio Colcher
- Abstract summary: This work aims to present the process of building a dataset of noisy audios, in a specific case of degenerated audios due to interference.
We also present initial results of a classifier that uses such data for evaluation, indicating the benefits of using this dataset in the recognizer's training process.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Automatic speech recognition systems are part of people's daily lives,
embedded in personal assistants and mobile phones, helping as a facilitator for
human-machine interaction while allowing access to information in a practically
intuitive way. Such systems are usually implemented using machine learning
techniques, especially with deep neural networks. Even with its high
performance in the task of transcribing text from speech, few works address the
issue of its recognition in noisy environments and, usually, the datasets used
do not contain noisy audio examples, while only mitigating this issue using
data augmentation techniques. This work aims to present the process of building
a dataset of noisy audios, in a specific case of degenerated audios due to
interference, commonly present in radio transmissions. Additionally, we present
initial results of a classifier that uses such data for evaluation, indicating
the benefits of using this dataset in the recognizer's training process. Such
recognizer achieves an average result of 0.4116 in terms of character error
rate in the noisy set (SNR = 30).
Related papers
- Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space [10.875499903992782]
We conduct a set of experiments around zero-shot learning with synthetic speech data for the specific task of speech commands classification.
Our results on the Google Speech Commands dataset show that a simple ASR-based filtering method can have a big impact in the quality of the generated data.
Despite the good quality of the generated speech data, we also show that synthetic and real speech can still be easily distinguishable when using self-supervised (WavLM) features.
arXiv Detail & Related papers (2024-09-19T13:07:55Z) - Some voices are too common: Building fair speech recognition systems
using the Common Voice dataset [2.28438857884398]
We use the French Common Voice dataset to quantify the biases of a pre-trained wav2vec2.0 model toward several demographic groups.
We also run an in-depth analysis of the Common Voice corpus and identify important shortcomings that should be taken into account.
arXiv Detail & Related papers (2023-06-01T11:42:34Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - Training Speech Enhancement Systems with Noisy Speech Datasets [7.157870452667369]
We propose two improvements to train SE systems on noisy speech data.
First, we propose several modifications of the loss functions, which make them robust against noisy speech targets.
We show that using our robust loss function improves PESQ by up to 0.19 compared to a system trained in the traditional way.
arXiv Detail & Related papers (2021-05-26T03:32:39Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z) - Speech Enhancement for Wake-Up-Word detection in Voice Assistants [60.103753056973815]
Keywords spotting and in particular Wake-Up-Word (WUW) detection is a very important task for voice assistants.
This paper proposes a Speech Enhancement model adapted to the task of WUW detection.
It aims at increasing the recognition rate and reducing the false alarms in the presence of these types of noises.
arXiv Detail & Related papers (2021-01-29T18:44:05Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z) - A Transfer Learning Method for Speech Emotion Recognition from Automatic
Speech Recognition [0.0]
We show a transfer learning method in speech emotion recognition based on a Time-Delay Neural Network architecture.
We achieve the highest significantly higher accuracy when compared to state-of-the-art, using five-fold cross validation.
arXiv Detail & Related papers (2020-08-06T20:37:22Z) - Unsupervised Learning of Audio Perception for Robotics Applications:
Learning to Project Data to T-SNE/UMAP space [2.8935588665357077]
This paper builds upon key ideas to build perception of touch sounds without access to any ground-truth data.
We show how we can leverage ideas from classical signal processing to get large amounts of data of any sound of interest with a high precision.
arXiv Detail & Related papers (2020-02-10T20:33:25Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.