GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4
- URL: http://arxiv.org/abs/2310.13988v1
- Date: Sat, 21 Oct 2023 12:30:33 GMT
- Title: GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4
- Authors: Tom Kocmi and Christian Federmann
- Abstract summary: This paper introduces GEMBA-MQM, a GPT-based evaluation metric to detect translation quality errors.
GEMBA-MQM employs a fixed three-shot prompting technique, querying the GPT-4 model to mark error quality spans.
Preliminary results indicate that GEMBA-MQM achieves state-of-the-art accuracy for system ranking.
- Score: 20.13049408028925
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces GEMBA-MQM, a GPT-based evaluation metric designed to
detect translation quality errors, specifically for the quality estimation
setting without the need for human reference translations. Based on the power
of large language models (LLM), GEMBA-MQM employs a fixed three-shot prompting
technique, querying the GPT-4 model to mark error quality spans. Compared to
previous works, our method has language-agnostic prompts, thus avoiding the
need for manual prompt preparation for new languages.
While preliminary results indicate that GEMBA-MQM achieves state-of-the-art
accuracy for system ranking, we advise caution when using it in academic works
to demonstrate improvements over other methods due to its dependence on the
proprietary, black-box GPT model.
Related papers
- QE-EBM: Using Quality Estimators as Energy Loss for Machine Translation [5.10832476049103]
We propose QE-EBM, a method of employing quality estimators as trainable loss networks that can directly backpropagate to the NMT model.
We examine our method on several low and high resource target languages with English as the source language.
arXiv Detail & Related papers (2024-10-14T07:39:33Z) - MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators [53.91199933655421]
Large Language Models (LLMs) have shown significant potential as judges for Machine Translation (MT) quality assessment.
We introduce a universal and training-free framework, $textbfMQM-APE, to enhance the quality of error annotations predicted by LLM evaluators.
arXiv Detail & Related papers (2024-09-22T06:43:40Z) - Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation [48.080874541824436]
We introduce Error Span.
ESA, a human evaluation protocol which combines the continuous rating of DA with the high-level.
error severity span marking of MQM.
ESA offers faster and cheaper annotations than MQM at the same quality level, without the requirement of expensive MQM experts.
arXiv Detail & Related papers (2024-06-17T14:20:47Z) - Multi-Dimensional Machine Translation Evaluation: Model Evaluation and Resource for Korean [7.843029855730508]
We develop a 1200-sentence MQM evaluation benchmark for the language pair English-Korean.
We find that reference-free setup outperforms its counterpart in the style dimension.
Overall, RemBERT emerges as the most promising model.
arXiv Detail & Related papers (2024-03-19T12:02:38Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large
Language Models on Sequence to Sequence Tasks [9.801767683867125]
We provide a preliminary and hybrid evaluation on three NLP benchmarks using both automatic and human evaluation.
We find that ChatGPT consistently outperforms many other popular models according to human reviewers on the majority of metrics.
We also find that human reviewers rate the gold reference as much worse than the best models' outputs, indicating the poor quality of many popular benchmarks.
arXiv Detail & Related papers (2023-10-20T20:17:09Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - Analyzing the Performance of GPT-3.5 and GPT-4 in Grammatical Error
Correction [28.58384091374763]
GPT-3 and GPT-4 models are powerful, achieving high performance on a variety of Natural Language Processing tasks.
We perform experiments testing the capabilities of a GPT-3.5 model (text-davinci-003) and a GPT-4 model (gpt-4-0314) on major GEC benchmarks.
We report the performance of our best prompt on the BEA-2019 and JFLEG datasets, finding that the GPT models can perform well in a sentence-level revision setting.
arXiv Detail & Related papers (2023-03-25T03:08:49Z) - Large Language Models Are State-of-the-Art Evaluators of Translation
Quality [7.818228526742237]
GEMBA is a GPT-based metric for assessment of translation quality.
We investigate nine versions of GPT models, including ChatGPT and GPT-4.
Our method achieves state-of-the-art accuracy in both modes when compared to MQM-based human labels.
arXiv Detail & Related papers (2023-02-28T12:23:48Z) - Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality.
We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z) - Measuring Uncertainty in Translation Quality Evaluation (TQE) [62.997667081978825]
This work carries out motivated research to correctly estimate the confidence intervals citeBrown_etal2001Interval depending on the sample size of the translated text.
The methodology we applied for this work is from Bernoulli Statistical Distribution Modelling (BSDM) and Monte Carlo Sampling Analysis (MCSA)
arXiv Detail & Related papers (2021-11-15T12:09:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.