Grammar Accuracy Evaluation (GAE): Quantifiable Intrinsic Evaluation of
Machine Translation Models
- URL: http://arxiv.org/abs/2105.14277v2
- Date: Tue, 1 Jun 2021 10:07:30 GMT
- Title: Grammar Accuracy Evaluation (GAE): Quantifiable Intrinsic Evaluation of
Machine Translation Models
- Authors: Dojun Park, Youngjin Jang and Harksoo Kim
- Abstract summary: In this paper, we propose Grammar Accuracy Evaluation (GAE) that can provide specific evaluating criteria.
As a result of analyzing the quality of machine translation by BLEU and GAE, it was confirmed that the BLEU score does not represent the absolute performance of machine translation models.
- Score: 3.007949058551534
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Intrinsic evaluation by humans for the performance of natural language
generation models is conducted to overcome the fact that the quality of
generated sentences cannot be fully represented by only extrinsic evaluation.
Nevertheless, existing intrinsic evaluations have a large score deviation
according to the evaluator's criteria. In this paper, we propose Grammar
Accuracy Evaluation (GAE) that can provide specific evaluating criteria. As a
result of analyzing the quality of machine translation by BLEU and GAE, it was
confirmed that the BLEU score does not represent the absolute performance of
machine translation models and that GAE compensates for the shortcomings of
BLEU with a flexible evaluation on alternative synonyms and changes in sentence
structure.
Related papers
- The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - xCOMET: Transparent Machine Translation Evaluation through Fine-grained
Error Detection [21.116517555282314]
xCOMET is an open-source learned metric designed to bridge the gap between machine translation evaluation approaches.
It integrates both sentence-level evaluation and error span detection capabilities, exhibiting state-of-the-art performance across all types of evaluation.
We also provide a robustness analysis with stress tests, and show that xCOMET is largely capable of identifying localized critical errors and hallucinations.
arXiv Detail & Related papers (2023-10-16T15:03:14Z) - Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference.
Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels.
Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z) - Evaluation of really good grammatical error correction [0.0]
Grammatical Error Correction (GEC) encompasses various models with distinct objectives.
Traditional evaluation methods fail to fully capture the full range of system capabilities and objectives.
arXiv Detail & Related papers (2023-08-17T13:45:35Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - A New Evaluation Method: Evaluation Data and Metrics for Chinese Grammar
Error Correction [4.60495447017298]
The evaluation values of the same error correction model can vary considerably under different word segmentation systems or different language models.
We propose three novel evaluation metrics for CGEC in two dimensions: reference-based and reference-less.
arXiv Detail & Related papers (2022-04-30T09:40:04Z) - HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using
Professional Post-Editing Towards More Effective MT Evaluation [0.0]
In this work, we introduce HOPE, a task-oriented and human-centric evaluation framework for machine translation output.
It contains only a limited number of commonly occurring error types, and use a scoring model with geometric progression of error penalty points (EPPs) reflecting error severity level to each translation unit.
The approach has several key advantages, such as ability to measure and compare less than perfect MT output from different systems, ability to indicate human perception of quality, immediate estimation of the labor effort required to bring MT output to premium quality, low-cost and faster application, as well as higher IRR.
arXiv Detail & Related papers (2021-12-27T18:47:43Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - Perception Score, A Learned Metric for Open-ended Text Generation
Evaluation [62.7690450616204]
We propose a novel and powerful learning-based evaluation metric: Perception Score.
The method measures the overall quality of the generation and scores holistically instead of only focusing on one evaluation criteria, such as word overlapping.
arXiv Detail & Related papers (2020-08-07T10:48:40Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.