Proficiency Matters Quality Estimation in Grammatical Error Correction
- URL: http://arxiv.org/abs/2201.06199v1
- Date: Mon, 17 Jan 2022 03:47:19 GMT
- Title: Proficiency Matters Quality Estimation in Grammatical Error Correction
- Authors: Yujin Takahashi, Masahiro Kaneko, Masato Mita, Mamoru Komachi
- Abstract summary: This study investigates how supervised quality estimation (QE) models of grammatical error correction (GEC) are affected by the learners' proficiency with the data.
- Score: 30.31557952622774
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study investigates how supervised quality estimation (QE) models of
grammatical error correction (GEC) are affected by the learners' proficiency
with the data. QE models for GEC evaluations in prior work have obtained a high
correlation with manual evaluations. However, when functioning in a real-world
context, the data used for the reported results have limitations because prior
works were biased toward data by learners with relatively high proficiency
levels. To address this issue, we created a QE dataset that includes multiple
proficiency levels and explored the necessity of performing proficiency-wise
evaluation for QE of GEC. Our experiments demonstrated that differences in
evaluation dataset proficiency affect the performance of QE models, and
proficiency-wise evaluation helps create more robust models.
Related papers
- Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Improving Machine Translation with Human Feedback: An Exploration of Quality Estimation as a Reward Model [75.66013048128302]
In this work, we investigate the potential of employing the QE model as the reward model to predict human preferences for feedback training.
We first identify the overoptimization problem during QE-based feedback training, manifested as an increase in reward while translation quality declines.
To address the problem, we adopt a simple yet effective method that uses rules to detect the incorrect translations and assigns a penalty term to the reward scores of them.
arXiv Detail & Related papers (2024-01-23T16:07:43Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Image Quality Assessment: Integrating Model-Centric and Data-Centric
Approaches [20.931709027443706]
Learning-based image quality assessment (IQA) has made remarkable progress in the past decade.
Nearly all consider the two key components -- model and data -- in isolation.
arXiv Detail & Related papers (2022-07-29T16:23:57Z) - Construction of a Quality Estimation Dataset for Automatic Evaluation of
Japanese Grammatical Error Correction [21.668187919351496]
In grammatical error correction (GEC), automatic evaluation is an important factor for research and development of GEC systems.
In this study, we created a quality estimation dataset with manual evaluation to build an automatic evaluation model for Japanese GEC.
arXiv Detail & Related papers (2022-01-20T08:07:42Z) - Pushing the Right Buttons: Adversarial Evaluation of Quality Estimation [25.325624543852086]
We propose a general methodology for adversarial testing of Quality Estimation for Machine Translation (MT) systems.
We show that despite a high correlation with human judgements achieved by the recent SOTA, certain types of meaning errors are still problematic for QE to detect.
Second, we show that on average, the ability of a given model to discriminate between meaning-preserving and meaning-altering perturbations is predictive of its overall performance.
arXiv Detail & Related papers (2021-09-22T17:32:18Z) - Classification-based Quality Estimation: Small and Efficient Models for
Real-world Applications [29.380675447523817]
Sentence-level Quality estimation (QE) of machine translation is traditionally formulated as a regression task.
Recent QE models have achieved previously-unseen levels of correlation with human judgments.
We evaluate several model compression techniques for QE and find that, despite their popularity in other NLP tasks, they lead to poor performance in this regression setting.
arXiv Detail & Related papers (2021-09-17T16:14:52Z) - Neural Quality Estimation with Multiple Hypotheses for Grammatical Error
Correction [98.31440090585376]
Grammatical Error Correction (GEC) aims to correct writing errors and help language learners improve their writing skills.
Existing GEC models tend to produce spurious corrections or fail to detect lots of errors.
This paper presents the Neural Verification Network (VERNet) for GEC quality estimation with multiple hypotheses.
arXiv Detail & Related papers (2021-05-10T15:04:25Z) - A Self-Refinement Strategy for Noise Reduction in Grammatical Error
Correction [54.569707226277735]
Existing approaches for grammatical error correction (GEC) rely on supervised learning with manually created GEC datasets.
There is a non-negligible amount of "noise" where errors were inappropriately edited or left uncorrected.
We propose a self-refinement method where the key idea is to denoise these datasets by leveraging the prediction consistency of existing models.
arXiv Detail & Related papers (2020-10-07T04:45:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.