Unsupervised Language agnostic WER Standardization
- URL: http://arxiv.org/abs/2303.05046v1
- Date: Thu, 9 Mar 2023 05:50:54 GMT
- Title: Unsupervised Language agnostic WER Standardization
- Authors: Satarupa Guha, Rahul Ambavat, Ankur Gupta, Manish Gupta, Rupeshkumar
Mehta
- Abstract summary: We propose an automatic WER normalization system consisting of two modules: spelling normalization and segmentation normalization.
Experiments with ASR on 35K utterances across four languages yielded an average WER reduction of 13.28%.
- Score: 4.768240090076601
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Word error rate (WER) is a standard metric for the evaluation of Automated
Speech Recognition (ASR) systems. However, WER fails to provide a fair
evaluation of human perceived quality in presence of spelling variations,
abbreviations, or compound words arising out of agglutination. Multiple
spelling variations might be acceptable based on locale/geography, alternative
abbreviations, borrowed words, and transliteration of code-mixed words from a
foreign language to the target language script. Similarly, in case of
agglutination, often times the agglutinated, as well as the split forms, are
acceptable. Previous work handled this problem by using manually identified
normalization pairs and applying them to both the transcription and the
hypothesis before computing WER. In this paper, we propose an automatic WER
normalization system consisting of two modules: spelling normalization and
segmentation normalization. The proposed system is unsupervised and language
agnostic, and therefore scalable. Experiments with ASR on 35K utterances across
four languages yielded an average WER reduction of 13.28%. Human judgements of
these automatically identified normalization pairs show that our WER-normalized
evaluation is highly consistent with the perceived quality of ASR output.
Related papers
- What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations [0.0]
We investigate the text normalization routine employed by leading ASR models, including OpenAI Whisper, Meta's MMS, Seamless, and Assembly AI's Conformer.
Our research reveals that current text normalization practices, while aiming to standardize ASR outputs for fair comparison, are fundamentally flawed when applied to Indic scripts.
We propose a shift towards developing text normalization routines that leverage native linguistic expertise.
arXiv Detail & Related papers (2024-09-04T05:08:23Z) - Speaker Tagging Correction With Non-Autoregressive Language Models [0.0]
We propose a speaker tagging correction system based on a non-autoregressive language model.
We show that the employed error correction approach leads to reductions in word diarization error rate (WDER) on two datasets.
arXiv Detail & Related papers (2024-08-30T11:02:17Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Lenient Evaluation of Japanese Speech Recognition: Modeling Naturally
Occurring Spelling Inconsistency [8.888638284299736]
We create a lattice of plausible respellings of the reference transcription using a combination of lexical resources, a Japanese text-processing system, and a neural machine translation model.
Our method, which does not penalize the system for choosing a valid alternate spelling of a word, affords a 2.4%-3.1% absolute reduction in CER depending on the task.
arXiv Detail & Related papers (2023-06-07T15:39:02Z) - Towards Unsupervised Recognition of Token-level Semantic Differences in
Related Documents [61.63208012250885]
We formulate recognizing semantic differences as a token-level regression task.
We study three unsupervised approaches that rely on a masked language model.
Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels.
arXiv Detail & Related papers (2023-05-22T17:58:04Z) - End-to-End Page-Level Assessment of Handwritten Text Recognition [69.55992406968495]
HTR systems increasingly face the end-to-end page-level transcription of a document.
Standard metrics do not take into account the inconsistencies that might appear.
We propose a two-fold evaluation, where the transcription accuracy and the RO goodness are considered separately.
arXiv Detail & Related papers (2023-01-14T15:43:07Z) - Benchmarking Evaluation Metrics for Code-Switching Automatic Speech
Recognition [19.763431520942028]
We develop a benchmark data set of code-switching speech recognition hypotheses with human judgments.
We define clear guidelines for minimal editing of automatic hypotheses.
We release the first corpus for human acceptance of code-switching speech recognition results in dialectal Arabic/English conversation speech.
arXiv Detail & Related papers (2022-11-22T08:14:07Z) - Exploiting prompt learning with pre-trained language models for
Alzheimer's Disease detection [70.86672569101536]
Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating preventive care and to delay further progression.
This paper investigates the use of prompt-based fine-tuning of PLMs that consistently uses AD classification errors as the training objective function.
arXiv Detail & Related papers (2022-10-29T09:18:41Z) - Rethink about the Word-level Quality Estimation for Machine Translation
from Human Judgement [57.72846454929923]
We create a benchmark dataset, emphHJQE, where the expert translators directly annotate poorly translated words.
We propose two tag correcting strategies, namely tag refinement strategy and tree-based annotation strategy, to make the TER-based artificial QE corpus closer to emphHJQE.
The results show our proposed dataset is more consistent with human judgement and also confirm the effectiveness of the proposed tag correcting strategies.
arXiv Detail & Related papers (2022-09-13T02:37:12Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.