Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics
- URL: http://arxiv.org/abs/2510.22028v1
- Date: Fri, 24 Oct 2025 21:22:06 GMT
- Title: Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics
- Authors: Yilin Zhang, Wenda Xu, Zhongtao Liu, Tetsuji Nakagawa, Markus Freitag,
- Abstract summary: Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and as a reward signal in tasks like reinforcement learning.<n>We reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation length, even for high-quality, error-free texts.<n>These inherent length biases risk unfairly penalizing longer, correct translations and can lead to sub-optimal decision-making in applications such as QE reranking and QE guided reinforcement learning.
- Score: 22.666172957826163
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and as a reward signal in tasks like reinforcement learning. However, the prevalence and impact of length bias in QE have been underexplored. Through a systematic study of top-performing regression-based and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation length, even for high-quality, error-free texts. Second, they exhibit a preference for shorter translations when multiple candidates are available for the same source text. These inherent length biases risk unfairly penalizing longer, correct translations and can lead to sub-optimal decision-making in applications such as QE reranking and QE guided reinforcement learning. To mitigate this, we propose two strategies: (a) applying length normalization during model training, and (b) incorporating reference texts during evaluation. Both approaches were found to effectively reduce the identified length bias.
Related papers
- UNCERTAINTY-LINE: Length-Invariant Estimation of Uncertainty for Large Language Models [51.53270695871237]
We show that UNCERTAINTY-LINE: consistently improves over even nominally length-normalized UQ methods uncertainty estimates.<n>Our method is post-hoc, model-agnostic, and applicable to a range of UQ measures.
arXiv Detail & Related papers (2025-05-25T09:30:43Z) - Same evaluation, more tokens: On the effect of input length for machine translation evaluation using Large Language Models [10.053064215267911]
Large language models (LLMs) can serve as reliable and interpretable sentence-level translation evaluators via MQM error span annotations.<n>We show that evaluation should be invariant to text length, producing consistent error spans regardless of input granularity.<n>We evaluate several strategies, including granularity-aligned prompting, Focus Sentence Prompting (FSP) and a fine-tuning approach to better align LLMs with the evaluation task.
arXiv Detail & Related papers (2025-05-03T09:30:26Z) - Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering [68.3400058037817]
We introduce TREQA (Translation Evaluation via Question-Answering), a framework that extrinsically evaluates translation quality.<n>We show that TREQA is competitive with and, in some cases, outperforms state-of-the-art neural and LLM-based metrics in ranking alternative paragraph-level translations.
arXiv Detail & Related papers (2025-04-10T09:24:54Z) - Watching the Watchers: Exposing Gender Disparities in Machine Translation Quality Estimation [28.01631390361754]
This paper defines and investigates gender bias of QE metrics.<n>We show that masculine-inflected translations score higher than feminine-inflected ones, and gender-neutral translations are penalized.<n>Our findings underscore the need for a renewed focus on developing and evaluating QE metrics centered on gender.
arXiv Detail & Related papers (2024-10-14T18:24:52Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - Improving Machine Translation with Human Feedback: An Exploration of Quality Estimation as a Reward Model [75.66013048128302]
In this work, we investigate the potential of employing the QE model as the reward model to predict human preferences for feedback training.
We first identify the overoptimization problem during QE-based feedback training, manifested as an increase in reward while translation quality declines.
To address the problem, we adopt a simple yet effective method that uses rules to detect the incorrect translations and assigns a penalty term to the reward scores of them.
arXiv Detail & Related papers (2024-01-23T16:07:43Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Rethink about the Word-level Quality Estimation for Machine Translation
from Human Judgement [57.72846454929923]
We create a benchmark dataset, emphHJQE, where the expert translators directly annotate poorly translated words.
We propose two tag correcting strategies, namely tag refinement strategy and tree-based annotation strategy, to make the TER-based artificial QE corpus closer to emphHJQE.
The results show our proposed dataset is more consistent with human judgement and also confirm the effectiveness of the proposed tag correcting strategies.
arXiv Detail & Related papers (2022-09-13T02:37:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.