Large Language Models Are State-of-the-Art Evaluators of Translation
Quality
- URL: http://arxiv.org/abs/2302.14520v2
- Date: Wed, 31 May 2023 19:51:51 GMT
- Title: Large Language Models Are State-of-the-Art Evaluators of Translation
Quality
- Authors: Tom Kocmi and Christian Federmann
- Abstract summary: GEMBA is a GPT-based metric for assessment of translation quality.
We investigate nine versions of GPT models, including ChatGPT and GPT-4.
Our method achieves state-of-the-art accuracy in both modes when compared to MQM-based human labels.
- Score: 7.818228526742237
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We describe GEMBA, a GPT-based metric for assessment of translation quality,
which works both with a reference translation and without. In our evaluation,
we focus on zero-shot prompting, comparing four prompt variants in two modes,
based on the availability of the reference. We investigate nine versions of GPT
models, including ChatGPT and GPT-4. We show that our method for translation
quality assessment only works with GPT~3.5 and larger models. Comparing to
results from WMT22's Metrics shared task, our method achieves state-of-the-art
accuracy in both modes when compared to MQM-based human labels. Our results are
valid on the system level for all three WMT22 Metrics shared task language
pairs, namely English into German, English into Russian, and Chinese into
English. This provides a first glimpse into the usefulness of pre-trained,
generative large language models for quality assessment of translations. We
publicly release all our code and prompt templates used for the experiments
described in this work, as well as all corresponding scoring results, to allow
for external validation and reproducibility.
Related papers
- Machine Translation for Ge'ez Language [0.0]
Machine translation for low-resource languages such as Ge'ez faces challenges such as out-of-vocabulary words, domain mismatches, and lack of labeled training data.
We develop a multilingual neural machine translation (MNMT) model based on languages relatedness.
We also experiment with using GPT-3.5, a state-of-the-art LLM, for few-shot translation with fuzzy matches.
arXiv Detail & Related papers (2023-11-24T14:55:23Z) - GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4 [20.13049408028925]
This paper introduces GEMBA-MQM, a GPT-based evaluation metric to detect translation quality errors.
GEMBA-MQM employs a fixed three-shot prompting technique, querying the GPT-4 model to mark error quality spans.
Preliminary results indicate that GEMBA-MQM achieves state-of-the-art accuracy for system ranking.
arXiv Detail & Related papers (2023-10-21T12:30:33Z) - Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large
Language Models on Sequence to Sequence Tasks [9.801767683867125]
We provide a preliminary and hybrid evaluation on three NLP benchmarks using both automatic and human evaluation.
We find that ChatGPT consistently outperforms many other popular models according to human reviewers on the majority of metrics.
We also find that human reviewers rate the gold reference as much worse than the best models' outputs, indicating the poor quality of many popular benchmarks.
arXiv Detail & Related papers (2023-10-20T20:17:09Z) - Document-Level Machine Translation with Large Language Models [91.03359121149595]
Large language models (LLMs) can produce coherent, cohesive, relevant, and fluent answers for various natural language processing (NLP) tasks.
This paper provides an in-depth evaluation of LLMs' ability on discourse modeling.
arXiv Detail & Related papers (2023-04-05T03:49:06Z) - How Good Are GPT Models at Machine Translation? A Comprehensive
Evaluation [16.90012234231392]
We show that GPT models achieve very competitive translation quality for high resource languages.
We also show that hybrid approaches, which combine GPT models with other translation systems, can further enhance the translation quality.
arXiv Detail & Related papers (2023-02-18T02:11:36Z) - Alibaba-Translate China's Submission for WMT 2022 Quality Estimation
Shared Task [80.22825549235556]
We present our submission to the sentence-level MQM benchmark at Quality Estimation Shared Task, named UniTE.
Specifically, our systems employ the framework of UniTE, which combined three types of input formats during training with a pre-trained language model.
Results show that our models reach 1st overall ranking in the Multilingual and English-Russian settings, and 2nd overall ranking in English-German and Chinese-English settings.
arXiv Detail & Related papers (2022-10-18T08:55:27Z) - News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets.
We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z) - Elaboration-Generating Commonsense Question Answering at Scale [77.96137534751445]
In question answering requiring common sense, language models (e.g., GPT-3) have been used to generate text expressing background knowledge.
We finetune smaller language models to generate useful intermediate context, referred to here as elaborations.
Our framework alternates between updating two language models -- an elaboration generator and an answer predictor -- allowing each to influence the other.
arXiv Detail & Related papers (2022-09-02T18:32:09Z) - UniTE: Unified Translation Evaluation [63.58868113074476]
UniTE is the first unified framework engaged with abilities to handle all three evaluation tasks.
We testify our framework on WMT 2019 Metrics and WMT 2020 Quality Estimation benchmarks.
arXiv Detail & Related papers (2022-04-28T08:35:26Z) - Learning to Evaluate Translation Beyond English: BLEURT Submissions to
the WMT Metrics 2020 Shared Task [30.889496911261677]
This paper describes our contribution to the WMT 2020 Metrics Shared Task.
We make several submissions based on BLEURT, a metric based on transfer learning.
We show how to combine BLEURT's predictions with those of YiSi and use alternative reference translations to enhance the performance.
arXiv Detail & Related papers (2020-10-08T23:16:26Z) - On the Limitations of Cross-lingual Encoders as Exposed by
Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity.
This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations.
We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER.
We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.