Rethinking Evaluation Metrics for Grammatical Error Correction: Why Use a Different Evaluation Process than Human?
- URL: http://arxiv.org/abs/2502.09416v1
- Date: Thu, 13 Feb 2025 15:39:07 GMT
- Title: Rethinking Evaluation Metrics for Grammatical Error Correction: Why Use a Different Evaluation Process than Human?
- Authors: Takumi Goto, Yusuke Sakai, Taro Watanabe,
- Abstract summary: We propose an aggregation method for automatic evaluation metrics which aligns with human evaluation methods to bridge the gap.
We conducted experiments using various metrics, including edit-based metrics, $n$-gram based metrics, and sentence-level metrics, and show that resolving the gap improves results for the most of metrics on the SEEDA benchmark.
- Score: 13.02513034520894
- License:
- Abstract: One of the goals of automatic evaluation metrics in grammatical error correction (GEC) is to rank GEC systems such that it matches human preferences. However, current automatic evaluations are based on procedures that diverge from human evaluation. Specifically, human evaluation derives rankings by aggregating sentence-level relative evaluation results, e.g., pairwise comparisons, using a rating algorithm, whereas automatic evaluation averages sentence-level absolute scores to obtain corpus-level scores, which are then sorted to determine rankings. In this study, we propose an aggregation method for existing automatic evaluation metrics which aligns with human evaluation methods to bridge this gap. We conducted experiments using various metrics, including edit-based metrics, $n$-gram based metrics, and sentence-level metrics, and show that resolving the gap improves results for the most of metrics on the SEEDA benchmark. We also found that even BERT-based metrics sometimes outperform the metrics of GPT-4. We publish our unified implementation of the metrics and meta-evaluations.
Related papers
- Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - Revisiting Meta-evaluation for Grammatical Error Correction [14.822205658480813]
SEEDA is a new dataset for GEC meta-evaluation.
It consists of corrections with human ratings along two different granularities.
The results suggest that edit-based metrics may have been underestimated in existing studies.
arXiv Detail & Related papers (2024-03-05T05:53:09Z) - How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs? [3.1706553206969925]
We perform a meta-evaluation of such methods and assess their reliability across a broad range of tasks.
We observe that while automatic evaluation methods can approximate human ratings under specific conditions, their validity is highly context-dependent.
Our findings enhance the understanding of how automatic methods should be applied and interpreted when developing and evaluating instruction-tuned LLMs.
arXiv Detail & Related papers (2024-02-16T15:48:33Z) - JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures
for Image Captioning Models [1.534667887016089]
We propose an automatic evaluation metric called JaSPICE, which evaluates Japanese captions based on scene graphs.
We conducted experiments employing 10 image captioning models trained on STAIR Captions and PFN-PIC and constructed the Shichimi dataset, which contains 103,170 human evaluations.
arXiv Detail & Related papers (2023-11-07T18:33:34Z) - Discordance Minimization-based Imputation Algorithms for Missing Values
in Rating Data [4.100928307172084]
When multiple rating lists are combined or considered together, subjects often have missing ratings.
We propose analyses on missing value patterns using six real-world data sets in various applications.
We propose optimization models and algorithms that minimize the total rating discordance across rating providers.
arXiv Detail & Related papers (2023-11-07T14:42:06Z) - C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue
Evaluation [68.59356746305255]
We propose a novel model-agnostic approach to measure the turn-level interaction between the system and the user.
Our approach significantly improves the correlation with human judgment compared with existing evaluation systems.
arXiv Detail & Related papers (2023-06-27T06:58:03Z) - Social Biases in Automatic Evaluation Metrics for NLG [53.76118154594404]
We propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics.
We construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks.
arXiv Detail & Related papers (2022-10-17T08:55:26Z) - Not All Errors are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation.
We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings.
SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.