Revisiting Grammatical Error Correction Evaluation and Beyond
- URL: http://arxiv.org/abs/2211.01635v1
- Date: Thu, 3 Nov 2022 07:55:12 GMT
- Title: Revisiting Grammatical Error Correction Evaluation and Beyond
- Authors: Peiyuan Gong, Xuebo Liu, Heyan Huang, Min Zhang
- Abstract summary: This paper takes the first step towards understanding and improving GEC evaluation with pretraining.
We propose a novel GEC evaluation metric to achieve the best of both worlds, namely PT-M2 which only uses PT-based metrics to score those corrected parts.
- Score: 38.12193886109598
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pretraining-based (PT-based) automatic evaluation metrics (e.g., BERTScore
and BARTScore) have been widely used in several sentence generation tasks
(e.g., machine translation and text summarization) due to their better
correlation with human judgments over traditional overlap-based methods.
Although PT-based methods have become the de facto standard for training
grammatical error correction (GEC) systems, GEC evaluation still does not
benefit from pretrained knowledge. This paper takes the first step towards
understanding and improving GEC evaluation with pretraining. We first find that
arbitrarily applying PT-based metrics to GEC evaluation brings unsatisfactory
correlation results because of the excessive attention to inessential systems
outputs (e.g., unchanged parts). To alleviate the limitation, we propose a
novel GEC evaluation metric to achieve the best of both worlds, namely PT-M2
which only uses PT-based metrics to score those corrected parts. Experimental
results on the CoNLL14 evaluation task show that PT-M2 significantly
outperforms existing methods, achieving a new state-of-the-art result of 0.949
Pearson correlation. Further analysis reveals that PT-M2 is robust to evaluate
competitive GEC systems. Source code and scripts are freely available at
https://github.com/pygongnlp/PT-M2.
Related papers
- Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics.
This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings.
We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z) - Aligning GPTRec with Beyond-Accuracy Goals with Reinforcement Learning [67.71952251641545]
GPTRec is an alternative to the Top-K model for item-by-item recommendations.
We show that GPTRec offers a better tradeoff between accuracy and secondary metrics than classic greedy re-ranking techniques.
Our experiments on two datasets show that GPTRec's Next-K generation approach offers a better tradeoff between accuracy and secondary metrics than classic greedy re-ranking techniques.
arXiv Detail & Related papers (2024-03-07T19:47:48Z) - Revisiting Meta-evaluation for Grammatical Error Correction [14.822205658480813]
SEEDA is a new dataset for GEC meta-evaluation.
It consists of corrections with human ratings along two different granularities.
The results suggest that edit-based metrics may have been underestimated in existing studies.
arXiv Detail & Related papers (2024-03-05T05:53:09Z) - MT-Ranker: Reference-free machine translation evaluation by inter-system
ranking [14.188948302661933]
We show that MT-Ranker, trained without any human annotations, achieves state-of-the-art results on the WMT Shared Metrics Task benchmarks DARR20, MQM20, and MQM21.
MT-Ranker marks state-of-the-art against reference-free as well as reference-based baselines.
arXiv Detail & Related papers (2024-01-30T15:30:03Z) - Grammatical Error Correction via Mixed-Grained Weighted Training [68.94921674855621]
Grammatical Error Correction (GEC) aims to automatically correct grammatical errors in natural texts.
MainGEC designs token-level and sentence-level training weights based on inherent discrepancies in accuracy and potential diversity of data annotation.
arXiv Detail & Related papers (2023-11-23T08:34:37Z) - Eval-GCSC: A New Metric for Evaluating ChatGPT's Performance in Chinese
Spelling Correction [60.32771192285546]
ChatGPT has demonstrated impressive performance in various downstream tasks.
In the Chinese Spelling Correction (CSC) task, we observe a discrepancy: while ChatGPT performs well under human evaluation, it scores poorly according to traditional metrics.
This paper proposes a new evaluation metric: Eval-GCSC. By incorporating word-level and semantic similarity judgments, it relaxes the stringent length and phonics constraints.
arXiv Detail & Related papers (2023-11-14T14:56:33Z) - Evaluation of really good grammatical error correction [0.0]
Grammatical Error Correction (GEC) encompasses various models with distinct objectives.
Traditional evaluation methods fail to fully capture the full range of system capabilities and objectives.
arXiv Detail & Related papers (2023-08-17T13:45:35Z) - C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue
Evaluation [68.59356746305255]
We propose a novel model-agnostic approach to measure the turn-level interaction between the system and the user.
Our approach significantly improves the correlation with human judgment compared with existing evaluation systems.
arXiv Detail & Related papers (2023-06-27T06:58:03Z) - Are Pre-trained Language Models Useful for Model Ensemble in Chinese
Grammatical Error Correction? [10.302225525539003]
We explore several ensemble strategies based on strong PLMs with four sophisticated single models.
The performance does not improve but even gets worse after the PLM-based ensemble.
arXiv Detail & Related papers (2023-05-24T14:18:52Z) - CLEME: Debiasing Multi-reference Evaluation for Grammatical Error
Correction [32.44051877804761]
Chunk-LEvel Multi-reference Evaluation (CLEME) is designed to evaluate Grammatical Error Correction (GEC) systems in the multi-reference evaluation setting.
We conduct experiments on six English reference sets based on the CoNLL-2014 shared task.
arXiv Detail & Related papers (2023-05-18T08:57:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.