Advocating Character Error Rate for Multilingual ASR Evaluation
- URL: http://arxiv.org/abs/2410.07400v2
- Date: Fri, 18 Oct 2024 15:54:56 GMT
- Title: Advocating Character Error Rate for Multilingual ASR Evaluation
- Authors: Thennal D K, Jesin James, Deepa P Gopinath, Muhammed Ashraf K,
- Abstract summary: We document the limitations of the word error rate (WER) as an evaluation metric and advocate for the character error rate (CER) as the primary metric.
We show that CER avoids many of the challenges WER faces and exhibits greater consistency across writing systems.
Our findings suggest that CER should be prioritized, or at least supplemented, in multilingual ASR evaluations to account for the varying linguistic characteristics of different languages.
- Score: 1.2597747768235845
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic speech recognition (ASR) systems have traditionally been evaluated using English datasets, with the word error rate (WER) serving as the predominant metric. WER's simplicity and ease of interpretation have contributed to its widespread adoption, particularly for English. However, as ASR systems expand to multilingual contexts, WER fails in various ways, particularly with morphologically complex languages or those without clear word boundaries. Our work documents the limitations of WER as an evaluation metric and advocates for the character error rate (CER) as the primary metric in multilingual ASR evaluation. We show that CER avoids many of the challenges WER faces and exhibits greater consistency across writing systems. We support our proposition by conducting human evaluations of ASR transcriptions in three languages: Malayalam, English, and Arabic, which exhibit distinct morphological characteristics. We show that CER correlates more closely with human judgments than WER, even for English. To facilitate further research, we release our human evaluation dataset for future benchmarking of ASR metrics. Our findings suggest that CER should be prioritized, or at least supplemented, in multilingual ASR evaluations to account for the varying linguistic characteristics of different languages.
Related papers
- Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking [68.77659513993507]
We present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy.
Our results show spoken language identification accuracy improvements of 8.7% and 6.1%, respectively, and word error rates which are 3.3% and 2.0% lower on these benchmarks.
arXiv Detail & Related papers (2024-09-27T03:31:32Z) - What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations [0.0]
We investigate the text normalization routine employed by leading ASR models, including OpenAI Whisper, Meta's MMS, Seamless, and Assembly AI's Conformer.
Our research reveals that current text normalization practices, while aiming to standardize ASR outputs for fair comparison, are fundamentally flawed when applied to Indic scripts.
We propose a shift towards developing text normalization routines that leverage native linguistic expertise.
arXiv Detail & Related papers (2024-09-04T05:08:23Z) - The Balancing Act: Unmasking and Alleviating ASR Biases in Portuguese [5.308321515594125]
This study is dedicated to a comprehensive exploration of the Whisper and MMS systems.
Our investigation encompasses various categories, including gender, age, skin tone color, and geo-location.
We empirically show that oversampling techniques alleviate such stereotypical biases.
arXiv Detail & Related papers (2024-02-12T09:35:13Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - CL-MASR: A Continual Learning Benchmark for Multilingual ASR [15.974765568276615]
We propose CL-MASR, a benchmark for studying multilingual automatic speech recognition in a continual learning setting.
CL-MASR provides a diverse set of continual learning methods implemented on top of large-scale pretrained ASR models, along with common metrics.
To the best of our knowledge, CL-MASR is the first continual learning benchmark for the multilingual ASR task.
arXiv Detail & Related papers (2023-10-25T18:55:40Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - Bilingual End-to-End ASR with Byte-Level Subwords [4.268218327369146]
We study different representations including character-level, byte-level, byte pair encoding (BPE), and byte-level byte pair encoding (BBPE)
We focus on developing a single end-to-end model to support utterance-based bilingual ASR, where speakers do not alternate between two languages in a single utterance but may change languages across utterances.
We find that BBPE with penalty schemes can improve utterance-based bilingual ASR performance by 2% to 5% relative even with smaller number of outputs and fewer parameters.
arXiv Detail & Related papers (2022-05-01T15:01:01Z) - Language Dependencies in Adversarial Attacks on Speech Recognition
Systems [0.0]
We compare the attackability of a German and an English ASR system.
We investigate if one of the language models is more susceptible to manipulations than the other.
arXiv Detail & Related papers (2022-02-01T13:27:40Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual
Retrieval [51.60862829942932]
We present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
For sentence-level CLIR, we demonstrate that state-of-the-art performance can be achieved.
However, the peak performance is not met using the general-purpose multilingual text encoders off-the-shelf', but rather relying on their variants that have been further specialized for sentence understanding tasks.
arXiv Detail & Related papers (2021-01-21T00:15:38Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.