Construction of a Quality Estimation Dataset for Automatic Evaluation of
Japanese Grammatical Error Correction
- URL: http://arxiv.org/abs/2201.08038v1
- Date: Thu, 20 Jan 2022 08:07:42 GMT
- Title: Construction of a Quality Estimation Dataset for Automatic Evaluation of
Japanese Grammatical Error Correction
- Authors: Daisuke Suzuki, Yujin Takahashi, Ikumi Yamashita, Taichi Aida, Tosho
Hirasawa, Michitaka Nakatsuji, Masato Mita, Mamoru Komachi
- Abstract summary: In grammatical error correction (GEC), automatic evaluation is an important factor for research and development of GEC systems.
In this study, we created a quality estimation dataset with manual evaluation to build an automatic evaluation model for Japanese GEC.
- Score: 21.668187919351496
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In grammatical error correction (GEC), automatic evaluation is an important
factor for research and development of GEC systems. Previous studies on
automatic evaluation have demonstrated that quality estimation models built
from datasets with manual evaluation can achieve high performance in automatic
evaluation of English GEC without using reference sentences.. However, quality
estimation models have not yet been studied in Japanese, because there are no
datasets for constructing quality estimation models. Therefore, in this study,
we created a quality estimation dataset with manual evaluation to build an
automatic evaluation model for Japanese GEC. Moreover, we conducted a
meta-evaluation to verify the dataset's usefulness in building the Japanese
quality estimation model.
Related papers
- Quality Estimation with $k$-nearest Neighbors and Automatic Evaluation for Model-specific Quality Estimation [14.405862891194344]
We propose a model-specific, unsupervised QE approach, termed $k$NN-QE, that extracts information from the MT model's training data using $k$-nearest neighbors.
Measuring the performance of model-specific QE is not straightforward, since they provide quality scores on their own MT output.
We propose an automatic evaluation method that uses quality scores from reference-based metrics as gold standard instead of human-generated ones.
arXiv Detail & Related papers (2024-04-27T23:52:51Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - On the Evaluation and Refinement of Vision-Language Instruction Tuning
Datasets [71.54954966652286]
We try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets.
We build a new dataset, REVO-LION, by collecting samples with higher SQ from each dataset.
Remarkably, even with only half of the complete data, the model trained on REVO-LION can achieve the performance comparable to simply adding all VLIT datasets up.
arXiv Detail & Related papers (2023-10-10T13:01:38Z) - Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference.
Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels.
Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z) - Evaluating the Generation Capabilities of Large Chinese Language Models [27.598864484231477]
This paper unveils CG-Eval, the first-ever comprehensive and automated evaluation framework.
It assesses the generative capabilities of large Chinese language models across a spectrum of academic disciplines.
Gscore automates the quality measurement of a model's text generation against reference standards.
arXiv Detail & Related papers (2023-08-09T09:22:56Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Proficiency Matters Quality Estimation in Grammatical Error Correction [30.31557952622774]
This study investigates how supervised quality estimation (QE) models of grammatical error correction (GEC) are affected by the learners' proficiency with the data.
arXiv Detail & Related papers (2022-01-17T03:47:19Z) - Quality Estimation without Human-labeled Data [25.25993509174361]
Quality estimation aims to measure the quality of translated content without access to a reference translation.
We propose a technique that does not rely on examples from human-annotators and instead uses synthetic training data.
We train off-the-shelf architectures for supervised quality estimation on our synthetic data and show that the resulting models achieve comparable performance to models trained on human-annotated data.
arXiv Detail & Related papers (2021-02-08T06:25:46Z) - Data Quality Evaluation using Probability Models [0.0]
It is shown that for the data examined, the ability to predict the quality of data based on simple good/bad pre-labelled learning examples is accurate.
arXiv Detail & Related papers (2020-09-14T18:12:19Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.