Related papers: Enhancing Human Evaluation in Machine Translation with Comparative Judgment

Enhancing Human Evaluation in Machine Translation with Comparative Judgment

URL: http://arxiv.org/abs/2502.17797v1
Date: Tue, 25 Feb 2025 03:02:24 GMT
Title: Enhancing Human Evaluation in Machine Translation with Comparative Judgment
Authors: Yixiao Song, Parker Riley, Daniel Deutsch, Markus Freitag,
Abstract summary: This study explores the integration of comparative judgment into human annotation for machine translation (MT)<n>It evaluates three annotation setups-point-wise Multidimensional Quality Metrics (MQM), side-by-side (SxS) MQM, and its simplified version SxS relative ranking (RR)
Score: 28.793446713389844
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human evaluation is crucial for assessing rapidly evolving language models but is influenced by annotator proficiency and task design. This study explores the integration of comparative judgment into human annotation for machine translation (MT) and evaluates three annotation setups-point-wise Multidimensional Quality Metrics (MQM), side-by-side (SxS) MQM, and its simplified version SxS relative ranking (RR). In MQM, annotators mark error spans with categories and severity levels. SxS MQM extends MQM to pairwise error annotation for two translations of the same input, while SxS RR focuses on selecting the better output without labeling errors. Key findings are: (1) the SxS settings achieve higher inter-annotator agreement than MQM; (2) SxS MQM enhances inter-translation error marking consistency compared to MQM by, on average, 38.5% for explicitly compared MT systems and 19.5% for others; (3) all annotation settings return stable system rankings, with SxS RR offering a more efficient alternative to (SxS) MQM; (4) the SxS settings highlight subtle errors overlooked in MQM without altering absolute system evaluations. To spur further research, we will release the triply annotated datasets comprising 377 ZhEn and 104 EnDe annotation examples.

Related papers

Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation [57.11989521509119]
We propose a novel agentic translation evaluation framework, centered by a reflective Core Agent that invokes specialized sub-agents.<n> Experimental results indicate the efficacy of RATE, achieving an improvement of at least 3.2 meta score compared with current metrics.
arXiv Detail & Related papers (2026-01-12T09:03:42Z)
MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation [22.41599031199308]
We experiment with a two-stage version of the current state-of-the-art translation evaluation paradigm (MQM)<n>In this setup, an MQM annotator reviews and edits a set of pre-existing MQM annotations, that may have come from themselves, another human annotator, or an automatic MQM annotation system.<n>We demonstrate that rater behavior in re-annotation aligns with our goals, and that re-annotation results in higher-quality annotations, mostly due to finding errors that were missed during the first pass.
arXiv Detail & Related papers (2025-10-28T17:29:59Z)
MAATS: A Multi-Agent Automated Translation System Based on MQM Evaluation [9.331779458661831]
MAATS employs multiple specialized AI agents, each focused on a distinct MQM category.<n>It excels particularly in semantic accuracy, locale adaptation, and linguistically distant language pairs.<n>By aligning modular agent roles with interpretable MQM dimensions, MAATS narrows the gap between black-box LLMs and human translation.
arXiv Detail & Related papers (2025-05-20T19:29:05Z)
MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators [53.91199933655421]
Large Language Models (LLMs) have shown significant potential as judges for Machine Translation (MT) quality assessment.<n>We introduce a universal and training-free framework, $textbfMQM-APE, based on the idea of filtering out non-impactful errors.<n>Experiments show that our approach consistently improves both the reliability and quality of error spans against GEMBA-MQM.
arXiv Detail & Related papers (2024-09-22T06:43:40Z)
Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation [48.080874541824436]
We introduce Error Span. ESA, a human evaluation protocol which combines the continuous rating of DA with the high-level. error severity span marking of MQM. ESA offers faster and cheaper annotations than MQM at the same quality level, without the requirement of expensive MQM experts.
arXiv Detail & Related papers (2024-06-17T14:20:47Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Multi-Dimensional Machine Translation Evaluation: Model Evaluation and Resource for Korean [7.843029855730508]
We develop a 1200-sentence MQM evaluation benchmark for the language pair English-Korean. We find that reference-free setup outperforms its counterpart in the style dimension. Overall, RemBERT emerges as the most promising model.
arXiv Detail & Related papers (2024-03-19T12:02:38Z)
The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations. We study the impact of labeled data through in-context learning and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z)
Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models [57.80514758695275]
Using large language models (LLMs) for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level. We propose a new prompting method called textbftextttError Analysis Prompting (EAPrompt) This technique emulates the commonly accepted human evaluation framework - Multidimensional Quality Metrics (MQM) and textitproduces explainable and reliable MT evaluations at both the system and segment level.
arXiv Detail & Related papers (2023-03-24T05:05:03Z)
Mismatching-Aware Unsupervised Translation Quality Estimation For Low-Resource Languages [6.049660810617423]
XLMRScore is a cross-lingual counterpart of BERTScore computed via the XLM-RoBERTa (XLMR) model. We evaluate the proposed method on four low-resource language pairs of the WMT21 QE shared task.
arXiv Detail & Related papers (2022-07-31T16:23:23Z)
On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity. This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations. We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER. We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.