Evaluation of really good grammatical error correction
- URL: http://arxiv.org/abs/2308.08982v1
- Date: Thu, 17 Aug 2023 13:45:35 GMT
- Title: Evaluation of really good grammatical error correction
- Authors: Robert \"Ostling, Katarina Gillholm, Murathan Kurfal{\i}, Marie
Mattson, Mats Wir\'en
- Abstract summary: Grammatical Error Correction (GEC) encompasses various models with distinct objectives.
Traditional evaluation methods fail to fully capture the full range of system capabilities and objectives.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Although rarely stated, in practice, Grammatical Error Correction (GEC)
encompasses various models with distinct objectives, ranging from grammatical
error detection to improving fluency. Traditional evaluation methods fail to
fully capture the full range of system capabilities and objectives.
Reference-based evaluations suffer from limitations in capturing the wide
variety of possible correction and the biases introduced during reference
creation and is prone to favor fixing local errors over overall text
improvement. The emergence of large language models (LLMs) has further
highlighted the shortcomings of these evaluation strategies, emphasizing the
need for a paradigm shift in evaluation methodology. In the current study, we
perform a comprehensive evaluation of various GEC systems using a recently
published dataset of Swedish learner texts. The evaluation is performed using
established evaluation metrics as well as human judges. We find that GPT-3 in a
few-shot setting by far outperforms previous grammatical error correction
systems for Swedish, a language comprising only 0.11% of its training data. We
also found that current evaluation methods contain undesirable biases that a
human evaluation is able to reveal. We suggest using human post-editing of GEC
system outputs to analyze the amount of change required to reach native-level
human performance on the task, and provide a dataset annotated with human
post-edits and assessments of grammaticality, fluency and meaning preservation
of GEC system outputs.
Related papers
- CLEME2.0: Towards More Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction [28.533044857379647]
The paper focuses on improving the interpretability of Grammatical Error Correction (GEC) metrics.
We propose CLEME2.0, a reference-based evaluation strategy that can describe four elementary dimensions of GEC systems.
arXiv Detail & Related papers (2024-07-01T03:35:58Z) - Large Language Models Are State-of-the-Art Evaluator for Grammatical Error Correction [14.822205658480813]
Large Language Models (LLMs) have been reported to outperform existing automatic evaluation metrics in some tasks.
This study investigates the performance of LLMs in grammatical error correction (GEC) evaluation by employing prompts inspired by previous research.
arXiv Detail & Related papers (2024-03-26T09:43:15Z) - Revisiting Meta-evaluation for Grammatical Error Correction [14.822205658480813]
SEEDA is a new dataset for GEC meta-evaluation.
It consists of corrections with human ratings along two different granularities.
The results suggest that edit-based metrics may have been underestimated in existing studies.
arXiv Detail & Related papers (2024-03-05T05:53:09Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference.
Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels.
Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z) - MISMATCH: Fine-grained Evaluation of Machine-generated Text with
Mismatch Error Types [68.76742370525234]
We propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts.
Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types.
We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.
arXiv Detail & Related papers (2023-06-18T01:38:53Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - A New Evaluation Method: Evaluation Data and Metrics for Chinese Grammar
Error Correction [4.60495447017298]
The evaluation values of the same error correction model can vary considerably under different word segmentation systems or different language models.
We propose three novel evaluation metrics for CGEC in two dimensions: reference-based and reference-less.
arXiv Detail & Related papers (2022-04-30T09:40:04Z) - TextFlint: Unified Multilingual Robustness Evaluation Toolkit for
Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint)
It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis.
TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.