Related papers: Robustness of end-to-end Automatic Speech Recognition Models -- A Case Study using Mozilla DeepSpeech

Robustness of end-to-end Automatic Speech Recognition Models -- A Case Study using Mozilla DeepSpeech

URL: http://arxiv.org/abs/2105.09742v1
Date: Sat, 8 May 2021 16:46:44 GMT
Title: Robustness of end-to-end Automatic Speech Recognition Models -- A Case Study using Mozilla DeepSpeech
Authors: Aashish Agarwal and Torsten Zesch
Abstract summary: We argue that many performance numbers reported probably underestimate the expected error rate. We conduct experiments controlling for selection bias, gender as well as overlap (between training and test data) in content, voices, and recording conditions.
Score: 2.715884199292287
License: http://creativecommons.org/licenses/by/4.0/
Abstract: When evaluating the performance of automatic speech recognition models, usually word error rate within a certain dataset is used. Special care must be taken in understanding the dataset in order to report realistic performance numbers. We argue that many performance numbers reported probably underestimate the expected error rate. We conduct experiments controlling for selection bias, gender as well as overlap (between training and test data) in content, voices, and recording conditions. We find that content overlap has the biggest impact, but other factors like gender also play a role.

Related papers

Exploring Gender Disparities in Automatic Speech Recognition Technology [22.729651340592586]
We analyze how performance varies across different gender representations in training data. Our findings suggest a complex interplay between the gender ratio in training data and ASR performance.
arXiv Detail & Related papers (2025-02-25T18:29:38Z)
Everyone deserves their voice to be heard: Analyzing Predictive Gender Bias in ASR Models Applied to Dutch Speech Data [13.91630413828167]
This study focuses on identifying the performance disparities of Whisper models on Dutch speech data. We analyzed the word error rate, character error rate and a BERT-based semantic similarity across gender groups.
arXiv Detail & Related papers (2024-11-14T13:29:09Z)
Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction [54.23208041792073]
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review. A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods. We propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels.
arXiv Detail & Related papers (2024-06-26T05:30:21Z)
Detecting Response Generation Not Requiring Factual Judgment [14.921007421043198]
This study aimed to achieve both attractiveness and factuality in a dialogue response for which a task was set to predict sentences that do not require factual correctness judgment. We created a dataset, dialogue dataset annotated with fact-check-needed label (DDFC), for this task via crowdsourcing, and classification tasks were performed on several models using this dataset. The model with the highest classification accuracy could yield about 88% accurate classification results.
arXiv Detail & Related papers (2024-06-14T04:03:24Z)
Influence Scores at Scale for Efficient Language Data Sampling [3.072340427031969]
"influence scores" are used to identify important subsets of data. In this paper, we explore the applicability of influence scores in language classification tasks.
arXiv Detail & Related papers (2023-11-27T20:19:22Z)
The Impact of Debiasing on the Performance of Language Models in Downstream Tasks is Underestimated [70.23064111640132]
We compare the impact of debiasing on performance across multiple downstream tasks using a wide-range of benchmark datasets. Experiments show that the effects of debiasing are consistently emphunderestimated across all tasks.
arXiv Detail & Related papers (2023-09-16T20:25:34Z)
Robust Hate Speech Detection in Social Media: A Cross-Dataset Empirical Evaluation [5.16706940452805]
We perform a large-scale cross-dataset comparison where we fine-tune language models on different hate speech detection datasets. This analysis shows how some datasets are more generalisable than others when used as training data. Experiments show how combining hate speech detection datasets can contribute to the development of robust hate speech detection models.
arXiv Detail & Related papers (2023-07-04T12:22:40Z)
Models See Hallucinations: Evaluating the Factuality in Video Captioning [57.85548187177109]
We conduct a human evaluation of the factuality in video captioning and collect two annotated factuality datasets. We find that 57.0% of the model-generated sentences have factual errors, indicating it is a severe problem in this field. We propose a weakly-supervised, model-based factuality metric FactVC, which outperforms previous metrics on factuality evaluation of video captioning.
arXiv Detail & Related papers (2023-03-06T08:32:50Z)
Unsupervised Fine-Tuning Data Selection for ASR Using Self-Supervised Speech Models [13.956691231452336]
Self-supervised learning (SSL) has been able to leverage unlabeled data to boost the performance of automatic speech recognition (ASR) models. Our work investigates different unsupervised data selection techniques for fine-tuning the HuBERT model under a limited transcription budget.
arXiv Detail & Related papers (2022-12-03T18:05:08Z)
Impact of Pretraining Term Frequencies on Few-Shot Reasoning [51.990349528930125]
We investigate how well pretrained language models reason with terms that are less frequent in the pretraining data. We measure the strength of this correlation for a number of GPT-based language models on various numerical deduction tasks. Although LMs exhibit strong performance at few-shot numerical reasoning tasks, our results raise the question of how much models actually generalize beyond pretraining data.
arXiv Detail & Related papers (2022-02-15T05:43:54Z)
AES Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models. Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models. We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z)
Utilizing Self-supervised Representations for MOS Prediction [51.09985767946843]
Existing evaluations usually require clean references or parallel ground truth data. Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception. We develop an automatic evaluation approach that correlates well with human perception while not requiring ground truth data.
arXiv Detail & Related papers (2021-04-07T09:44:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.