Related papers: Revisiting Meta-evaluation for Grammatical Error Correction

Revisiting Meta-evaluation for Grammatical Error Correction

URL: http://arxiv.org/abs/2403.02674v2
Date: Sun, 26 May 2024 12:05:35 GMT
Title: Revisiting Meta-evaluation for Grammatical Error Correction
Authors: Masamune Kobayashi, Masato Mita, Mamoru Komachi,
Abstract summary: SEEDA is a new dataset for GEC meta-evaluation. It consists of corrections with human ratings along two different granularities. The results suggest that edit-based metrics may have been underestimated in existing studies.
Score: 14.822205658480813
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Metrics are the foundation for automatic evaluation in grammatical error correction (GEC), with their evaluation of the metrics (meta-evaluation) relying on their correlation with human judgments. However, conventional meta-evaluations in English GEC encounter several challenges including biases caused by inconsistencies in evaluation granularity, and an outdated setup using classical systems. These problems can lead to misinterpretation of metrics and potentially hinder the applicability of GEC techniques. To address these issues, this paper proposes SEEDA, a new dataset for GEC meta-evaluation. SEEDA consists of corrections with human ratings along two different granularities: edit-based and sentence-based, covering 12 state-of-the-art systems including large language models (LLMs), and two human corrections with different focuses. The results of improved correlations by aligning the granularity in the sentence-level meta-evaluation, suggest that edit-based metrics may have been underestimated in existing studies. Furthermore, correlations of most metrics decrease when changing from classical to neural systems, indicating that traditional metrics are relatively poor at evaluating fluently corrected sentences with many edits.

Related papers

Improving Explainability of Sentence-level Metrics via Edit-level Attribution for Grammatical Error Correction [11.512856112792093]
We propose attributing sentence-level scores to individual edits, providing insight into how specific corrections contribute to the overall performance. Experiments with existing sentence-level metrics demonstrate high consistency across different edit granularities and show approximately 70% alignment with human evaluations. In addition, we analyze biases in the metrics based on the attribution results, revealing trends such as the tendency to ignore orthographic edits.
arXiv Detail & Related papers (2024-12-17T17:31:17Z)
Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics. This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings. We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z)
CLEME2.0: Towards More Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction [28.533044857379647]
The paper focuses on improving the interpretability of Grammatical Error Correction (GEC) metrics. We propose CLEME2.0, a reference-based evaluation strategy that can describe four elementary dimensions of GEC systems.
arXiv Detail & Related papers (2024-07-01T03:35:58Z)
Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged. In this paper, we study if there are any deficiencies in reference-free metrics. We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z)
Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs. This dataset aims to discover whether metrics can identify 68 translation accuracy errors. We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z)
Eval-GCSC: A New Metric for Evaluating ChatGPT's Performance in Chinese Spelling Correction [60.32771192285546]
ChatGPT has demonstrated impressive performance in various downstream tasks. In the Chinese Spelling Correction (CSC) task, we observe a discrepancy: while ChatGPT performs well under human evaluation, it scores poorly according to traditional metrics. This paper proposes a new evaluation metric: Eval-GCSC. By incorporating word-level and semantic similarity judgments, it relaxes the stringent length and phonics constraints.
arXiv Detail & Related papers (2023-11-14T14:56:33Z)
Evaluation of really good grammatical error correction [0.0]
Grammatical Error Correction (GEC) encompasses various models with distinct objectives. Traditional evaluation methods fail to fully capture the full range of system capabilities and objectives.
arXiv Detail & Related papers (2023-08-17T13:45:35Z)
Learning Evaluation Models from Large Language Models for Sequence Generation [61.8421748792555]
We propose a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development. Experimental results on the SummEval benchmark demonstrate that CSEM can effectively train an evaluation model without human-labeled data.
arXiv Detail & Related papers (2023-08-08T16:41:16Z)
C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue Evaluation [68.59356746305255]
We propose a novel model-agnostic approach to measure the turn-level interaction between the system and the user. Our approach significantly improves the correlation with human judgment compared with existing evaluation systems.
arXiv Detail & Related papers (2023-06-27T06:58:03Z)
CLEME: Debiasing Multi-reference Evaluation for Grammatical Error Correction [32.44051877804761]
Chunk-LEvel Multi-reference Evaluation (CLEME) is designed to evaluate Grammatical Error Correction (GEC) systems in the multi-reference evaluation setting. We conduct experiments on six English reference sets based on the CoNLL-2014 shared task.
arXiv Detail & Related papers (2023-05-18T08:57:17Z)
Revisiting Grammatical Error Correction Evaluation and Beyond [38.12193886109598]
This paper takes the first step towards understanding and improving GEC evaluation with pretraining. We propose a novel GEC evaluation metric to achieve the best of both worlds, namely PT-M2 which only uses PT-based metrics to score those corrected parts.
arXiv Detail & Related papers (2022-11-03T07:55:12Z)
A New Evaluation Method: Evaluation Data and Metrics for Chinese Grammar Error Correction [4.60495447017298]
The evaluation values of the same error correction model can vary considerably under different word segmentation systems or different language models. We propose three novel evaluation metrics for CGEC in two dimensions: reference-based and reference-less.
arXiv Detail & Related papers (2022-04-30T09:40:04Z)
Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment. We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.