Assessing ASR Model Quality on Disordered Speech using BERTScore
- URL: http://arxiv.org/abs/2209.10591v1
- Date: Wed, 21 Sep 2022 18:33:33 GMT
- Title: Assessing ASR Model Quality on Disordered Speech using BERTScore
- Authors: Jimmy Tobin, Qisheng Li, Subhashini Venugopalan, Katie Seaver, Richard
Cave, Katrin Tomanek
- Abstract summary: Word Error Rate (WER) is the primary metric used to assess automatic speech recognition (ASR) model quality.
It has been shown that ASR models tend to have much higher WER on speakers with speech impairments than typical English speakers.
This study investigates the use of BERTScore, an evaluation metric for text generation, to provide a more informative measure of ASR model quality and usefulness.
- Score: 5.489867271342724
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Word Error Rate (WER) is the primary metric used to assess automatic speech
recognition (ASR) model quality. It has been shown that ASR models tend to have
much higher WER on speakers with speech impairments than typical English
speakers. It is hard to determine if models can be be useful at such high error
rates. This study investigates the use of BERTScore, an evaluation metric for
text generation, to provide a more informative measure of ASR model quality and
usefulness. Both BERTScore and WER were compared to prediction errors manually
annotated by Speech Language Pathologists for error type and assessment.
BERTScore was found to be more correlated with human assessment of error type
and assessment. BERTScore was specifically more robust to orthographic changes
(contraction and normalization errors) where meaning was preserved.
Furthermore, BERTScore was a better fit of error assessment than WER, as
measured using an ordinal logistic regression and the Akaike's Information
Criterion (AIC). Overall, our findings suggest that BERTScore can complement
WER when assessing ASR model performance from a practical perspective,
especially for accessibility applications where models are useful even at lower
accuracy than for typical speech.
Related papers
- Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models [69.38024658668887]
Current evaluation method for event extraction relies on token-level exact match.
We propose RAEE, an automatic evaluation framework that accurately assesses event extraction results at semantic-level instead of token-level.
arXiv Detail & Related papers (2024-10-12T07:54:01Z) - Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking [68.77659513993507]
We present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy.
Our results show spoken language identification accuracy improvements of 8.7% and 6.1%, respectively, and word error rates which are 3.3% and 2.0% lower on these benchmarks.
arXiv Detail & Related papers (2024-09-27T03:31:32Z) - Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments [2.1370543868467275]
This follow-up paper explores methods to align Large Language Models evaluator preferences with human evaluations.
We employed Bayesian statistics and a t-test to quantify this bias and developed a recalibration procedure to adjust the GPTScorer.
Our findings significantly improve aligning the recalibrated LLM evaluator with human evaluations across multiple use cases.
arXiv Detail & Related papers (2024-07-05T09:26:40Z) - Useful Blunders: Can Automated Speech Recognition Errors Improve
Downstream Dementia Classification? [9.275790963007173]
We investigated how errors from automatic speech recognition (ASR) systems affect dementia classification accuracy.
We aimed to assess whether imperfect ASR-generated transcripts could provide valuable information.
arXiv Detail & Related papers (2024-01-10T21:38:03Z) - LibriSpeech-PC: Benchmark for Evaluation of Punctuation and
Capitalization Capabilities of end-to-end ASR Models [58.790604613878216]
We introduce a LibriSpeech-PC benchmark designed to assess the punctuation and capitalization prediction capabilities of end-to-end ASR models.
The benchmark includes a LibriSpeech-PC dataset with restored punctuation and capitalization, a novel evaluation metric called Punctuation Error Rate (PER) that focuses on punctuation marks, and initial baseline models.
arXiv Detail & Related papers (2023-10-04T16:23:37Z) - Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference.
Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels.
Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z) - NoRefER: a Referenceless Quality Metric for Automatic Speech Recognition
via Semi-Supervised Language Model Fine-Tuning with Contrastive Learning [0.20999222360659603]
NoRefER is a novel referenceless quality metric for automatic speech recognition (ASR) systems.
NoRefER exploits the known quality relationships between hypotheses from multiple compression levels of an ASR for learning to rank intra-sample hypotheses by quality.
The results indicate that NoRefER correlates highly with reference-based metrics and their intra-sample ranks, indicating a high potential for referenceless ASR evaluation or a/b testing.
arXiv Detail & Related papers (2023-06-21T21:26:19Z) - Toward Human-Like Evaluation for Natural Language Generation with Error
Analysis [93.34894810865364]
Recent studies show that considering both major errors (e.g. mistranslated tokens) and minor errors can produce high-quality human judgments.
This inspires us to approach the final goal of the evaluation metrics (human-like evaluations) by automatic error analysis.
We augment BARTScore by incorporating the human-like error analysis strategies, namely BARTScore++, where the final score consists of both the evaluations of major errors and minor errors.
arXiv Detail & Related papers (2022-12-20T11:36:22Z) - TRScore: A Novel GPT-based Readability Scorer for ASR Segmentation and
Punctuation model evaluation and selection [1.4720080476520687]
Punctuation and readability are key to readability in Automatic Speech Recognition.
Human evaluation is expensive, time-consuming, and suffers from large inter-observer variability.
We present TRScore, a novel readability measure using the GPT model to evaluate different segmentation and punctuation systems.
arXiv Detail & Related papers (2022-10-27T01:11:32Z) - ASR in German: A Detailed Error Analysis [0.0]
This work presents a selection of ASR model architectures that are pretrained on the German language and evaluates them on a benchmark of diverse test datasets.
It identifies cross-architectural prediction errors, classifies those into categories and traces the sources of errors per category back into training data.
arXiv Detail & Related papers (2022-04-12T08:25:01Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.