A New Evaluation Method: Evaluation Data and Metrics for Chinese Grammar
Error Correction
- URL: http://arxiv.org/abs/2205.00217v1
- Date: Sat, 30 Apr 2022 09:40:04 GMT
- Title: A New Evaluation Method: Evaluation Data and Metrics for Chinese Grammar
Error Correction
- Authors: Nankai Lin, Nankai Lin, Xiaotian Lin, Ziyu Yang, Shengyi Jiang
- Abstract summary: The evaluation values of the same error correction model can vary considerably under different word segmentation systems or different language models.
We propose three novel evaluation metrics for CGEC in two dimensions: reference-based and reference-less.
- Score: 4.60495447017298
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As a fundamental task in natural language processing, Chinese Grammatical
Error Correction (CGEC) has gradually received widespread attention and become
a research hotspot. However, one obvious deficiency for the existing CGEC
evaluation system is that the evaluation values are significantly influenced by
the Chinese word segmentation results or different language models. The
evaluation values of the same error correction model can vary considerably
under different word segmentation systems or different language models.
However, it is expected that these metrics should be independent of the word
segmentation results and language models, as they may lead to a lack of
uniqueness and comparability in the evaluation of different methods. To this
end, we propose three novel evaluation metrics for CGEC in two dimensions:
reference-based and reference-less. In terms of the reference-based metric, we
introduce sentence-level accuracy and char-level BLEU to evaluate the corrected
sentences. Besides, in terms of the reference-less metric, we adopt char-level
meaning preservation to measure the semantic preservation degree of the
corrected sentences. We deeply evaluate and analyze the reasonableness and
validity of the three proposed metrics, and we expect them to become a new
standard for CGEC.
Related papers
- Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Revisiting Meta-evaluation for Grammatical Error Correction [14.822205658480813]
SEEDA is a new dataset for GEC meta-evaluation.
It consists of corrections with human ratings along two different granularities.
The results suggest that edit-based metrics may have been underestimated in existing studies.
arXiv Detail & Related papers (2024-03-05T05:53:09Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - Evaluation of really good grammatical error correction [0.0]
Grammatical Error Correction (GEC) encompasses various models with distinct objectives.
Traditional evaluation methods fail to fully capture the full range of system capabilities and objectives.
arXiv Detail & Related papers (2023-08-17T13:45:35Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - CLEME: Debiasing Multi-reference Evaluation for Grammatical Error
Correction [32.44051877804761]
Chunk-LEvel Multi-reference Evaluation (CLEME) is designed to evaluate Grammatical Error Correction (GEC) systems in the multi-reference evaluation setting.
We conduct experiments on six English reference sets based on the CoNLL-2014 shared task.
arXiv Detail & Related papers (2023-05-18T08:57:17Z) - DEMETR: Diagnosing Evaluation Metrics for Translation [21.25704103403547]
We release DEMETR, a diagnostic dataset with 31K English examples.
We find that learned metrics perform substantially better than string-based metrics on DEMETR.
arXiv Detail & Related papers (2022-10-25T03:25:44Z) - MuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for Chinese
Grammatical Error Correction [51.3754092853434]
MuCGEC is a multi-reference evaluation dataset for Chinese Grammatical Error Correction (CGEC)
It consists of 7,063 sentences collected from three different Chinese-as-a-Second-Language (CSL) learner sources.
Each sentence has been corrected by three annotators, and their corrections are meticulously reviewed by an expert, resulting in 2.3 references per sentence.
arXiv Detail & Related papers (2022-04-23T05:20:38Z) - Grammar Accuracy Evaluation (GAE): Quantifiable Intrinsic Evaluation of
Machine Translation Models [3.007949058551534]
In this paper, we propose Grammar Accuracy Evaluation (GAE) that can provide specific evaluating criteria.
As a result of analyzing the quality of machine translation by BLEU and GAE, it was confirmed that the BLEU score does not represent the absolute performance of machine translation models.
arXiv Detail & Related papers (2021-05-29T11:40:51Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.