CLEME: Debiasing Multi-reference Evaluation for Grammatical Error
Correction
- URL: http://arxiv.org/abs/2305.10819v2
- Date: Tue, 17 Oct 2023 04:56:57 GMT
- Title: CLEME: Debiasing Multi-reference Evaluation for Grammatical Error
Correction
- Authors: Jingheng Ye, Yinghui Li, Qingyu Zhou, Yangning Li, Shirong Ma, Hai-Tao
Zheng, Ying Shen
- Abstract summary: Chunk-LEvel Multi-reference Evaluation (CLEME) is designed to evaluate Grammatical Error Correction (GEC) systems in the multi-reference evaluation setting.
We conduct experiments on six English reference sets based on the CoNLL-2014 shared task.
- Score: 32.44051877804761
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating the performance of Grammatical Error Correction (GEC) systems is a
challenging task due to its subjectivity. Designing an evaluation metric that
is as objective as possible is crucial to the development of GEC task. However,
mainstream evaluation metrics, i.e., reference-based metrics, introduce bias
into the multi-reference evaluation by extracting edits without considering the
presence of multiple references. To overcome this issue, we propose Chunk-LEvel
Multi-reference Evaluation (CLEME), designed to evaluate GEC systems in the
multi-reference evaluation setting. CLEME builds chunk sequences with
consistent boundaries for the source, the hypothesis and references, thus
eliminating the bias caused by inconsistent edit boundaries. Furthermore, we
observe the consistent boundary could also act as the boundary of grammatical
errors, based on which the F$_{0.5}$ score is then computed following the
correction independence assumption. We conduct experiments on six English
reference sets based on the CoNLL-2014 shared task. Extensive experiments and
detailed analyses demonstrate the correctness of our discovery and the
effectiveness of CLEME. Further analysis reveals that CLEME is robust to
evaluate GEC systems across reference sets with varying numbers of references
and annotation style.
Related papers
- HICEScore: A Hierarchical Metric for Image Captioning Evaluation [10.88292081473071]
We propose a novel reference-free metric for image captioning evaluation, dubbed Hierarchical Image Captioning Evaluation Score (HICE-S)
By detecting local visual regions and textual phrases, HICE-S builds an interpretable hierarchical scoring mechanism.
Our proposed metric achieves the SOTA performance on several benchmarks, outperforming existing reference-free metrics.
arXiv Detail & Related papers (2024-07-26T08:24:30Z) - CLEME2.0: Towards More Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction [28.533044857379647]
The paper focuses on improving the interpretability of Grammatical Error Correction (GEC) metrics.
We propose CLEME2.0, a reference-based evaluation strategy that can describe four elementary dimensions of GEC systems.
arXiv Detail & Related papers (2024-07-01T03:35:58Z) - Revisiting Meta-evaluation for Grammatical Error Correction [14.822205658480813]
SEEDA is a new dataset for GEC meta-evaluation.
It consists of corrections with human ratings along two different granularities.
The results suggest that edit-based metrics may have been underestimated in existing studies.
arXiv Detail & Related papers (2024-03-05T05:53:09Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - Eval-GCSC: A New Metric for Evaluating ChatGPT's Performance in Chinese
Spelling Correction [60.32771192285546]
ChatGPT has demonstrated impressive performance in various downstream tasks.
In the Chinese Spelling Correction (CSC) task, we observe a discrepancy: while ChatGPT performs well under human evaluation, it scores poorly according to traditional metrics.
This paper proposes a new evaluation metric: Eval-GCSC. By incorporating word-level and semantic similarity judgments, it relaxes the stringent length and phonics constraints.
arXiv Detail & Related papers (2023-11-14T14:56:33Z) - Evaluation of really good grammatical error correction [0.0]
Grammatical Error Correction (GEC) encompasses various models with distinct objectives.
Traditional evaluation methods fail to fully capture the full range of system capabilities and objectives.
arXiv Detail & Related papers (2023-08-17T13:45:35Z) - A New Evaluation Method: Evaluation Data and Metrics for Chinese Grammar
Error Correction [4.60495447017298]
The evaluation values of the same error correction model can vary considerably under different word segmentation systems or different language models.
We propose three novel evaluation metrics for CGEC in two dimensions: reference-based and reference-less.
arXiv Detail & Related papers (2022-04-30T09:40:04Z) - MuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for Chinese
Grammatical Error Correction [51.3754092853434]
MuCGEC is a multi-reference evaluation dataset for Chinese Grammatical Error Correction (CGEC)
It consists of 7,063 sentences collected from three different Chinese-as-a-Second-Language (CSL) learner sources.
Each sentence has been corrected by three annotators, and their corrections are meticulously reviewed by an expert, resulting in 2.3 references per sentence.
arXiv Detail & Related papers (2022-04-23T05:20:38Z) - LM-Critic: Language Models for Unsupervised Grammatical Error Correction [128.9174409251852]
We show how to leverage a pretrained language model (LM) in defining an LM-Critic, which judges a sentence to be grammatical.
We apply this LM-Critic and BIFI along with a large set of unlabeled sentences to bootstrap realistic ungrammatical / grammatical pairs for training a corrector.
arXiv Detail & Related papers (2021-09-14T17:06:43Z) - REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation
Metrics for Open-domain Dialog Generation [63.46331073232526]
We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems.
A prediction model is designed to estimate the reliability of the given reference set.
We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
arXiv Detail & Related papers (2021-05-30T10:04:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.