Style-agnostic evaluation of ASR using multiple reference transcripts
- URL: http://arxiv.org/abs/2412.07937v1
- Date: Tue, 10 Dec 2024 21:47:15 GMT
- Title: Style-agnostic evaluation of ASR using multiple reference transcripts
- Authors: Quinten McNamara, Miguel Ángel del Río Fernández, Nishchal Bhandari, Martin Ratajczak, Danny Chen, Corey Miller, Migüel Jetté,
- Abstract summary: We attempt to mitigate some of these differences by performing style-agnostic evaluation of ASR systems.
We find that existing WER reports are likely significantly over-estimating the number of contentful errors made by state-of-the-art ASR systems.
- Score: 0.3066137405373616
- License:
- Abstract: Word error rate (WER) as a metric has a variety of limitations that have plagued the field of speech recognition. Evaluation datasets suffer from varying style, formality, and inherent ambiguity of the transcription task. In this work, we attempt to mitigate some of these differences by performing style-agnostic evaluation of ASR systems using multiple references transcribed under opposing style parameters. As a result, we find that existing WER reports are likely significantly over-estimating the number of contentful errors made by state-of-the-art ASR systems. In addition, we have found our multireference method to be a useful mechanism for comparing the quality of ASR models that differ in the stylistic makeup of their training data and target task.
Related papers
- A Benchmark of French ASR Systems Based on Error Severity [6.657432034629865]
A new evaluation is proposed that categorizes errors into four levels of severity.
This metric is applied to a benchmark of 10 state-of-the-art ASR systems on French language.
arXiv Detail & Related papers (2025-01-18T21:07:18Z) - Quantification of stylistic differences in human- and ASR-produced transcripts of African American English [1.8021379035665333]
Stylistic differences, such as verbatim vs non-verbatim, can play a significant role in ASR performance evaluation.
We categorize the kinds of stylistic differences between 6 transcription versions, 4 human- and 2 ASR-produced, of 10 hours of African American English speech.
We investigate the interactions of these categories with how well transcripts can be compared via word error rate.
arXiv Detail & Related papers (2024-09-04T20:18:59Z) - What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations [0.0]
We investigate the text normalization routine employed by leading ASR models, including OpenAI Whisper, Meta's MMS, Seamless, and Assembly AI's Conformer.
Our research reveals that current text normalization practices, while aiming to standardize ASR outputs for fair comparison, are fundamentally flawed when applied to Indic scripts.
We propose a shift towards developing text normalization routines that leverage native linguistic expertise.
arXiv Detail & Related papers (2024-09-04T05:08:23Z) - Towards interfacing large language models with ASR systems using confidence measures and prompting [54.39667883394458]
This work investigates post-hoc correction of ASR transcripts with large language models (LLMs)
To avoid introducing errors into likely accurate transcripts, we propose a range of confidence-based filtering methods.
Our results indicate that this can improve the performance of less competitive ASR systems.
arXiv Detail & Related papers (2024-07-31T08:00:41Z) - RAGGED: Towards Informed Design of Retrieval Augmented Generation Systems [51.171355532527365]
Retrieval-augmented generation (RAG) can significantly improve the performance of language models (LMs)
RAGGED is a framework for analyzing RAG configurations across various document-based question answering tasks.
arXiv Detail & Related papers (2024-03-14T02:26:31Z) - Useful Blunders: Can Automated Speech Recognition Errors Improve
Downstream Dementia Classification? [9.275790963007173]
We investigated how errors from automatic speech recognition (ASR) systems affect dementia classification accuracy.
We aimed to assess whether imperfect ASR-generated transcripts could provide valuable information.
arXiv Detail & Related papers (2024-01-10T21:38:03Z) - Explanations for Automatic Speech Recognition [9.810810252231812]
We provide an explanation for an ASR transcription as a subset of audio frames.
We adapt existing explainable AI techniques from image classification-Statistical Fault Localisation(SFL) and Causal.
We evaluate the quality of the explanations generated by the proposed techniques over three different ASR,Google API, the baseline model of Sphinx, Deepspeech and 100 audio samples from the Commonvoice dataset.
arXiv Detail & Related papers (2023-02-27T11:09:19Z) - End-to-End Page-Level Assessment of Handwritten Text Recognition [69.55992406968495]
HTR systems increasingly face the end-to-end page-level transcription of a document.
Standard metrics do not take into account the inconsistencies that might appear.
We propose a two-fold evaluation, where the transcription accuracy and the RO goodness are considered separately.
arXiv Detail & Related papers (2023-01-14T15:43:07Z) - Retrofitting Multilingual Sentence Embeddings with Abstract Meaning
Representation [70.58243648754507]
We introduce a new method to improve existing multilingual sentence embeddings with Abstract Meaning Representation (AMR)
Compared with the original textual input, AMR is a structured semantic representation that presents the core concepts and relations in a sentence explicitly and unambiguously.
Experiment results show that retrofitting multilingual sentence embeddings with AMR leads to better state-of-the-art performance on both semantic similarity and transfer tasks.
arXiv Detail & Related papers (2022-10-18T11:37:36Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.