H_eval: A new hybrid evaluation metric for automatic speech recognition
tasks
- URL: http://arxiv.org/abs/2211.01722v3
- Date: Fri, 1 Dec 2023 12:54:42 GMT
- Title: H_eval: A new hybrid evaluation metric for automatic speech recognition
tasks
- Authors: Zitha Sasindran, Harsha Yelchuri, T. V. Prabhakar, Supreeth Rao
- Abstract summary: We propose H_eval, a new hybrid evaluation metric for ASR systems.
It considers both semantic correctness and error rate and performs significantly well in scenarios where WER and SD perform poorly.
- Score: 0.3277163122167433
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many studies have examined the shortcomings of word error rate (WER) as an
evaluation metric for automatic speech recognition (ASR) systems. Since WER
considers only literal word-level correctness, new evaluation metrics based on
semantic similarity such as semantic distance (SD) and BERTScore have been
developed. However, we found that these metrics have their own limitations,
such as a tendency to overly prioritise keywords. We propose H_eval, a new
hybrid evaluation metric for ASR systems that considers both semantic
correctness and error rate and performs significantly well in scenarios where
WER and SD perform poorly. Due to lighter computation compared to BERTScore, it
offers 49 times reduction in metric computation time. Furthermore, we show that
H_eval correlates strongly with downstream NLP tasks. Also, to reduce the
metric calculation time, we built multiple fast and lightweight models using
distillation techniques
Related papers
- Linear-time Minimum Bayes Risk Decoding with Reference Aggregation [52.1701152610258]
Minimum Bayes Risk (MBR) decoding is a text generation technique that has been shown to improve the quality of machine translations.
It requires the pairwise calculation of a utility metric, which has quadratic complexity.
We propose to approximate pairwise metric scores with scores calculated against aggregated reference representations.
arXiv Detail & Related papers (2024-02-06T18:59:30Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - Timestamped Embedding-Matching Acoustic-to-Word CTC ASR [2.842794675894731]
We describe a novel method of training an embedding-matching word-level connectionist temporal classification (CTC) automatic speech recognizer (ASR)
The word timestamps enable the ASR to output word segmentations and word confusion networks without relying on a secondary model or forced alignment process when testing.
arXiv Detail & Related papers (2023-06-20T11:53:43Z) - Not All Errors are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation.
We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings.
SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - Toward Zero Oracle Word Error Rate on the Switchboard Benchmark [0.3297645391680979]
The "Switchboard benchmark" is a very well-known test set in automatic speech recognition (ASR) research.
This work highlights lesser-known practical considerations of this evaluation, demonstrating major improvements in word error rate (WER)
Even commercial ASR systems can score below 5% WER and the established record for a research system is lowered to 2.3%.
arXiv Detail & Related papers (2022-06-13T14:26:40Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z) - BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation [16.81712151903078]
Natural language processing (NLP) systems are increasingly trained to generate open-ended text.
Different metrics have different strengths and biases, and reflect human intuitions better on some tasks than others.
Here, we describe the Benchmark to Evaluate Automatic Metrics (BEAMetrics) to make research into new metrics itself easier to evaluate.
arXiv Detail & Related papers (2021-10-18T10:03:19Z) - Semantic Distance: A New Metric for ASR Performance Analysis Towards
Spoken Language Understanding [26.958001571944678]
We propose a novel Semantic Distance (SemDist) measure as an alternative evaluation metric for ASR systems.
We demonstrate the effectiveness of our proposed metric on various downstream tasks, including intent recognition, semantic parsing, and named entity recognition.
arXiv Detail & Related papers (2021-04-05T20:25:07Z) - KPQA: A Metric for Generative Question Answering Using Keyphrase Weights [64.54593491919248]
KPQA-metric is a new metric for evaluating correctness of generative question answering systems.
Our new metric assigns different weights to each token via keyphrase prediction.
We show that our proposed metric has a significantly higher correlation with human judgments than existing metrics.
arXiv Detail & Related papers (2020-05-01T03:24:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.