MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation
- URL: http://arxiv.org/abs/2510.24664v1
- Date: Tue, 28 Oct 2025 17:29:59 GMT
- Title: MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation
- Authors: Parker Riley, Daniel Deutsch, Mara Finkelstein, Colten DiIanni, Juraj Juraska, Markus Freitag,
- Abstract summary: We experiment with a two-stage version of the current state-of-the-art translation evaluation paradigm (MQM)<n>In this setup, an MQM annotator reviews and edits a set of pre-existing MQM annotations, that may have come from themselves, another human annotator, or an automatic MQM annotation system.<n>We demonstrate that rater behavior in re-annotation aligns with our goals, and that re-annotation results in higher-quality annotations, mostly due to finding errors that were missed during the first pass.
- Score: 22.41599031199308
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human evaluation of machine translation is in an arms race with translation model quality: as our models get better, our evaluation methods need to be improved to ensure that quality gains are not lost in evaluation noise. To this end, we experiment with a two-stage version of the current state-of-the-art translation evaluation paradigm (MQM), which we call MQM re-annotation. In this setup, an MQM annotator reviews and edits a set of pre-existing MQM annotations, that may have come from themselves, another human annotator, or an automatic MQM annotation system. We demonstrate that rater behavior in re-annotation aligns with our goals, and that re-annotation results in higher-quality annotations, mostly due to finding errors that were missed during the first pass.
Related papers
- Decoupling Perception and Calibration: Label-Efficient Image Quality Assessment Framework [78.58395822978271]
LEAF is a Label-Efficient Image Quality Assessment Framework.<n>It distills perceptual quality priors from an MLLM teacher into a lightweight student regressor.<n>Our method significantly reduces the need for human annotations while maintaining strong MOS-aligned correlations.
arXiv Detail & Related papers (2026-01-28T15:15:17Z) - HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation [39.7293877954587]
HiMATE is a Hierarchical Multi-Agent Framework for Machine Translation Evaluation.<n>We develop a hierarchical multi-agent system grounded in the MQM error typology, enabling granular evaluation of subtype errors.<n> Empirically, HiMATE outperforms competitive baselines across different datasets in conducting human-aligned evaluations.
arXiv Detail & Related papers (2025-05-22T06:24:08Z) - MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators [53.91199933655421]
Large Language Models (LLMs) have shown significant potential as judges for Machine Translation (MT) quality assessment.<n>We introduce a universal and training-free framework, $textbfMQM-APE, based on the idea of filtering out non-impactful errors.<n>Experiments show that our approach consistently improves both the reliability and quality of error spans against GEMBA-MQM.
arXiv Detail & Related papers (2024-09-22T06:43:40Z) - Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation [48.080874541824436]
We introduce Error Span.
ESA, a human evaluation protocol which combines the continuous rating of DA with the high-level.
error severity span marking of MQM.
ESA offers faster and cheaper annotations than MQM at the same quality level, without the requirement of expensive MQM experts.
arXiv Detail & Related papers (2024-06-17T14:20:47Z) - Multi-Dimensional Machine Translation Evaluation: Model Evaluation and Resource for Korean [7.843029855730508]
We develop a 1200-sentence MQM evaluation benchmark for the language pair English-Korean.
We find that reference-free setup outperforms its counterpart in the style dimension.
Overall, RemBERT emerges as the most promising model.
arXiv Detail & Related papers (2024-03-19T12:02:38Z) - Improving Machine Translation with Human Feedback: An Exploration of Quality Estimation as a Reward Model [75.66013048128302]
In this work, we investigate the potential of employing the QE model as the reward model to predict human preferences for feedback training.
We first identify the overoptimization problem during QE-based feedback training, manifested as an increase in reward while translation quality declines.
To address the problem, we adopt a simple yet effective method that uses rules to detect the incorrect translations and assigns a penalty term to the reward scores of them.
arXiv Detail & Related papers (2024-01-23T16:07:43Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - Practical Perspectives on Quality Estimation for Machine Translation [6.400178956011897]
Sentence level quality estimation (QE) for machine translation (MT) attempts to predict the translation edit rate (TER) cost of post-editing work required to correct MT output.
We find consumers of MT output who are primarily interested in a binary quality metric: is the translated sentence adequate as-is or does it need post-editing?
We demonstrate that, while classical QE regression models fare poorly on this task, they can be re-purposed by replacing the output regression layer with a binary classification one, achieving 50-60% recall at 90% precision.
arXiv Detail & Related papers (2020-05-02T01:50:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.