Remedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling
- URL: http://arxiv.org/abs/2504.13630v1
- Date: Fri, 18 Apr 2025 11:11:14 GMT
- Title: Remedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling
- Authors: Shaomu Tan, Christof Monz,
- Abstract summary: We propose ReMedy, a novel MT metric framework that reformulates translation evaluation as a reward modeling task.<n>Instead of regressing on imperfect human ratings directly, ReMedy learns relative translation quality using pairwise preference data.<n>In extensive experiments across WMT22-24 shared tasks, ReMedy achieves state-of-the-art performance at both segment- and system-level evaluation.
- Score: 4.548755617115687
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A key challenge in MT evaluation is the inherent noise and inconsistency of human ratings. Regression-based neural metrics struggle with this noise, while prompting LLMs shows promise at system-level evaluation but performs poorly at segment level. In this work, we propose ReMedy, a novel MT metric framework that reformulates translation evaluation as a reward modeling task. Instead of regressing on imperfect human ratings directly, ReMedy learns relative translation quality using pairwise preference data, resulting in a more reliable evaluation. In extensive experiments across WMT22-24 shared tasks (39 language pairs, 111 MT systems), ReMedy achieves state-of-the-art performance at both segment- and system-level evaluation. Specifically, ReMedy-9B surpasses larger WMT winners and massive closed LLMs such as MetricX-13B, XCOMET-Ensemble, GEMBA-GPT-4, PaLM-540B, and finetuned PaLM2. Further analyses demonstrate that ReMedy delivers superior capability in detecting translation errors and evaluating low-quality translations.
Related papers
- Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering [68.3400058037817]
We introduce TREQA (Translation Evaluation via Question-Answering), a framework that extrinsically evaluates translation quality.
We show that TREQA is competitive with and, in some cases, outperforms state-of-the-art neural and LLM-based metrics in ranking alternative paragraph-level translations.
arXiv Detail & Related papers (2025-04-10T09:24:54Z) - How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs [23.247387152595067]
LITEVAL-CORPUS is a parallel corpus containing verified human translations and outputs from 9 literary machine translation systems.
We examine the consistency and adequacy of human evaluation schemes with various degrees of complexity.
Our overall evaluation indicates that published human translations consistently outperform LLM translations.
arXiv Detail & Related papers (2024-10-24T12:48:03Z) - Large Language Models "Ad Referendum": How Good Are They at Machine
Translation in the Legal Domain? [0.0]
This study evaluates the machine translation (MT) quality of two state-of-the-art large language models (LLMs) against a tradition-al neural machine translation (NMT) system across four language pairs in the legal domain.
It combines automatic evaluation met-rics (AEMs) and human evaluation (HE) by professional transla-tors to assess translation ranking, fluency and adequacy.
arXiv Detail & Related papers (2024-02-12T14:40:54Z) - MT-Ranker: Reference-free machine translation evaluation by inter-system
ranking [14.188948302661933]
We show that MT-Ranker, trained without any human annotations, achieves state-of-the-art results on the WMT Shared Metrics Task benchmarks DARR20, MQM20, and MQM21.
MT-Ranker marks state-of-the-art against reference-free as well as reference-based baselines.
arXiv Detail & Related papers (2024-01-30T15:30:03Z) - "Knowing When You Don't Know": A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation [90.09260023184932]
Retrieval-Augmented Generation (RAG) grounds Large Language Model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations.
NoMIRACL is a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages.
We measure relevance assessment using: (i) hallucination rate, measuring model tendency to hallucinate, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset.
arXiv Detail & Related papers (2023-12-18T17:18:04Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models [57.80514758695275]
Using large language models (LLMs) for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level.
We propose a new prompting method called textbftextttError Analysis Prompting (EAPrompt)
This technique emulates the commonly accepted human evaluation framework - Multidimensional Quality Metrics (MQM) and textitproduces explainable and reliable MT evaluations at both the system and segment level.
arXiv Detail & Related papers (2023-03-24T05:05:03Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using
Professional Post-Editing Towards More Effective MT Evaluation [0.0]
In this work, we introduce HOPE, a task-oriented and human-centric evaluation framework for machine translation output.
It contains only a limited number of commonly occurring error types, and use a scoring model with geometric progression of error penalty points (EPPs) reflecting error severity level to each translation unit.
The approach has several key advantages, such as ability to measure and compare less than perfect MT output from different systems, ability to indicate human perception of quality, immediate estimation of the labor effort required to bring MT output to premium quality, low-cost and faster application, as well as higher IRR.
arXiv Detail & Related papers (2021-12-27T18:47:43Z) - On the Limitations of Cross-lingual Encoders as Exposed by
Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity.
This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations.
We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER.
We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.