Related papers: MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation

URL: http://arxiv.org/abs/2510.24664v1
Date: Tue, 28 Oct 2025 17:29:59 GMT
Title: MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation
Authors: Parker Riley, Daniel Deutsch, Mara Finkelstein, Colten DiIanni, Juraj Juraska, Markus Freitag,
Abstract summary: We experiment with a two-stage version of the current state-of-the-art translation evaluation paradigm (MQM)<n>In this setup, an MQM annotator reviews and edits a set of pre-existing MQM annotations, that may have come from themselves, another human annotator, or an automatic MQM annotation system.<n>We demonstrate that rater behavior in re-annotation aligns with our goals, and that re-annotation results in higher-quality annotations, mostly due to finding errors that were missed during the first pass.
Score: 22.41599031199308
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human evaluation of machine translation is in an arms race with translation model quality: as our models get better, our evaluation methods need to be improved to ensure that quality gains are not lost in evaluation noise. To this end, we experiment with a two-stage version of the current state-of-the-art translation evaluation paradigm (MQM), which we call MQM re-annotation. In this setup, an MQM annotator reviews and edits a set of pre-existing MQM annotations, that may have come from themselves, another human annotator, or an automatic MQM annotation system. We demonstrate that rater behavior in re-annotation aligns with our goals, and that re-annotation results in higher-quality annotations, mostly due to finding errors that were missed during the first pass.

Related papers

Decoupling Perception and Calibration: Label-Efficient Image Quality Assessment Framework [78.58395822978271]
LEAF is a Label-Efficient Image Quality Assessment Framework.<n>It distills perceptual quality priors from an MLLM teacher into a lightweight student regressor.<n>Our method significantly reduces the need for human annotations while maintaining strong MOS-aligned correlations.
arXiv Detail & Related papers (2026-01-28T15:15:17Z)
HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation [39.7293877954587]
HiMATE is a Hierarchical Multi-Agent Framework for Machine Translation Evaluation.<n>We develop a hierarchical multi-agent system grounded in the MQM error typology, enabling granular evaluation of subtype errors.<n> Empirically, HiMATE outperforms competitive baselines across different datasets in conducting human-aligned evaluations.
arXiv Detail & Related papers (2025-05-22T06:24:08Z)
MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators [53.91199933655421]
Large Language Models (LLMs) have shown significant potential as judges for Machine Translation (MT) quality assessment.<n>We introduce a universal and training-free framework, $textbfMQM-APE, based on the idea of filtering out non-impactful errors.<n>Experiments show that our approach consistently improves both the reliability and quality of error spans against GEMBA-MQM.
arXiv Detail & Related papers (2024-09-22T06:43:40Z)
Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation [48.080874541824436]
We introduce Error Span. ESA, a human evaluation protocol which combines the continuous rating of DA with the high-level. error severity span marking of MQM. ESA offers faster and cheaper annotations than MQM at the same quality level, without the requirement of expensive MQM experts.
arXiv Detail & Related papers (2024-06-17T14:20:47Z)
Multi-Dimensional Machine Translation Evaluation: Model Evaluation and Resource for Korean [7.843029855730508]
We develop a 1200-sentence MQM evaluation benchmark for the language pair English-Korean. We find that reference-free setup outperforms its counterpart in the style dimension. Overall, RemBERT emerges as the most promising model.
arXiv Detail & Related papers (2024-03-19T12:02:38Z)
Improving Machine Translation with Human Feedback: An Exploration of Quality Estimation as a Reward Model [75.66013048128302]
In this work, we investigate the potential of employing the QE model as the reward model to predict human preferences for feedback training. We first identify the overoptimization problem during QE-based feedback training, manifested as an increase in reward while translation quality declines. To address the problem, we adopt a simple yet effective method that uses rules to detect the incorrect translations and assigns a penalty term to the reward scores of them.
arXiv Detail & Related papers (2024-01-23T16:07:43Z)
SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation) We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z)
The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations. We study the impact of labeled data through in-context learning and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z)
Practical Perspectives on Quality Estimation for Machine Translation [6.400178956011897]
Sentence level quality estimation (QE) for machine translation (MT) attempts to predict the translation edit rate (TER) cost of post-editing work required to correct MT output. We find consumers of MT output who are primarily interested in a binary quality metric: is the translated sentence adequate as-is or does it need post-editing? We demonstrate that, while classical QE regression models fare poorly on this task, they can be re-purposed by replacing the output regression layer with a binary classification one, achieving 50-60% recall at 90% precision.
arXiv Detail & Related papers (2020-05-02T01:50:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.