Related papers: MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators

MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators

URL: http://arxiv.org/abs/2409.14335v2
Date: Mon, 16 Dec 2024 08:08:51 GMT
Title: MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators
Authors: Qingyu Lu, Liang Ding, Kanjian Zhang, Jinxia Zhang, Dacheng Tao,
Abstract summary: Large Language Models (LLMs) have shown significant potential as judges for Machine Translation (MT) quality assessment.<n>We introduce a universal and training-free framework, $textbfMQM-APE, based on the idea of filtering out non-impactful errors.<n>Experiments show that our approach consistently improves both the reliability and quality of error spans against GEMBA-MQM.
Score: 53.91199933655421
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Large Language Models (LLMs) have shown significant potential as judges for Machine Translation (MT) quality assessment, providing both scores and fine-grained feedback. Although approaches such as GEMBA-MQM have shown state-of-the-art performance on reference-free evaluation, the predicted errors do not align well with those annotated by human, limiting their interpretability as feedback signals. To enhance the quality of error annotations predicted by LLM evaluators, we introduce a universal and training-free framework, $\textbf{MQM-APE}$, based on the idea of filtering out non-impactful errors by Automatically Post-Editing (APE) the original translation based on each error, leaving only those errors that contribute to quality improvement. Specifically, we prompt the LLM to act as 1) $\textit{evaluator}$ to provide error annotations, 2) $\textit{post-editor}$ to determine whether errors impact quality improvement and 3) $\textit{pairwise quality verifier}$ as the error filter. Experiments show that our approach consistently improves both the reliability and quality of error spans against GEMBA-MQM, across eight LLMs in both high- and low-resource languages. Orthogonal to trained approaches, MQM-APE complements translation-specific evaluators such as Tower, highlighting its broad applicability. Further analysis confirms the effectiveness of each module and offers valuable insights into evaluator design and LLMs selection.

Related papers

Beyond Scalar Scores: Reinforcement Learning for Error-Aware Quality Estimation of Machine Translation [10.050982803590903]
Quality Estimation aims to assess the quality of machine translation (MT) outputs without relying on reference translations.<n>We introduce the first segment-level QE dataset for English to Malayalam, a severely resource-scarce language pair.<n>We introduce ALOPE-RL, a policy-based reinforcement learning framework that trains efficient adapters.
arXiv Detail & Related papers (2026-02-09T12:42:41Z)
HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation [38.67031685302134]
HiMATE is a Hierarchical Multi-Agent Framework for Machine Translation Evaluation.<n>We develop a hierarchical multi-agent system grounded in the MQM error typology, enabling granular evaluation of subtype errors.<n> Empirically, HiMATE outperforms competitive baselines across different datasets in conducting human-aligned evaluations.
arXiv Detail & Related papers (2025-05-22T06:24:08Z)
Same evaluation, more tokens: On the effect of input length for machine translation evaluation using Large Language Models [6.525298236457623]
Large language models (LLMs) can serve as reliable and interpretable sentence-level translation evaluators via MQM error span annotations.<n>We show that evaluation should be invariant to text length, producing consistent error spans regardless of input granularity.<n>We evaluate several strategies, including granularity-aligned prompting, Focus Sentence Prompting (FSP) and a fine-tuning approach to better align LLMs with the evaluation task.
arXiv Detail & Related papers (2025-05-03T09:30:26Z)
When LLMs Struggle: Reference-less Translation Evaluation for Low-resource Languages [9.138590152838754]
Segment-level quality estimation (QE) is a challenging cross-lingual language understanding task. We comprehensively evaluate large language models (LLMs) in zero/few-shot scenarios. Our results indicate that prompt-based approaches are outperformed by the encoder-based fine-tuned QE models.
arXiv Detail & Related papers (2025-01-08T12:54:05Z)
Guiding Large Language Models to Post-Edit Machine Translation with Error Annotations [14.149224539732913]
Machine Translation remains one of the last NLP tasks where large language models (LLMs) have not yet replaced dedicated supervised systems. This work exploits the complementary strengths of LLMs and supervised MT by guiding LLMs to automatically post-edit MT with external feedback on its quality. Experiments on Chinese-English, English-German, and English-Russian MQM data, we demonstrate that prompting LLMs to post-edit MT improves TER, BLEU and COMET scores. Fine-tuning helps integrate fine-grained feedback more effectively and further improves translation quality based on both automatic and human evaluation.
arXiv Detail & Related papers (2024-04-11T15:47:10Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Building Accurate Translation-Tailored LLMs with Language Aware Instruction Tuning [57.323716555996114]
Off-target translation remains an unsolved problem, especially for low-resource languages. Recent works have either designed advanced prompting strategies to highlight the functionality of translation instructions or exploited the in-context learning ability of LLMs. In this work, we design a two-stage fine-tuning algorithm to improve the instruction-following ability (especially the translation direction) of LLMs.
arXiv Detail & Related papers (2024-03-21T13:47:40Z)
TEaR: Improving LLM-based Machine Translation with Systematic Self-Refinement [26.26493253161022]
Large Language Models (LLMs) have achieved impressive results in Machine Translation (MT) We introduce a systematic LLM-based self-refinement translation framework, named textbfTEaR.
arXiv Detail & Related papers (2024-02-26T07:58:12Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback [65.84061725174269]
Recent large language models (LLM) are leveraging human feedback to improve their generation quality. We propose LLMRefine, an inference time optimization method to refine LLM's output. We conduct experiments on three text generation tasks, including machine translation, long-form question answering (QA), and topical summarization. LLMRefine consistently outperforms all baseline approaches, achieving improvements up to 1.7 MetricX points on translation tasks, 8.1 ROUGE-L on ASQA, 2.2 ROUGE-L on topical summarization.
arXiv Detail & Related papers (2023-11-15T19:52:11Z)
xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection [21.116517555282314]
xCOMET is an open-source learned metric designed to bridge the gap between machine translation evaluation approaches. It integrates both sentence-level evaluation and error span detection capabilities, exhibiting state-of-the-art performance across all types of evaluation. We also provide a robustness analysis with stress tests, and show that xCOMET is largely capable of identifying localized critical errors and hallucinations.
arXiv Detail & Related papers (2023-10-16T15:03:14Z)
The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations. We study the impact of labeled data through in-context learning and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z)
Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models [57.80514758695275]
Using large language models (LLMs) for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level. We propose a new prompting method called textbftextttError Analysis Prompting (EAPrompt) This technique emulates the commonly accepted human evaluation framework - Multidimensional Quality Metrics (MQM) and textitproduces explainable and reliable MT evaluations at both the system and segment level.
arXiv Detail & Related papers (2023-03-24T05:05:03Z)
Rethink about the Word-level Quality Estimation for Machine Translation from Human Judgement [57.72846454929923]
We create a benchmark dataset, emphHJQE, where the expert translators directly annotate poorly translated words. We propose two tag correcting strategies, namely tag refinement strategy and tree-based annotation strategy, to make the TER-based artificial QE corpus closer to emphHJQE. The results show our proposed dataset is more consistent with human judgement and also confirm the effectiveness of the proposed tag correcting strategies.
arXiv Detail & Related papers (2022-09-13T02:37:12Z)
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professional Post-Editing Towards More Effective MT Evaluation [0.0]
In this work, we introduce HOPE, a task-oriented and human-centric evaluation framework for machine translation output. It contains only a limited number of commonly occurring error types, and use a scoring model with geometric progression of error penalty points (EPPs) reflecting error severity level to each translation unit. The approach has several key advantages, such as ability to measure and compare less than perfect MT output from different systems, ability to indicate human perception of quality, immediate estimation of the labor effort required to bring MT output to premium quality, low-cost and faster application, as well as higher IRR.
arXiv Detail & Related papers (2021-12-27T18:47:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.