Related papers: Watching the Watchers: Exposing Gender Disparities in Machine Translation Quality Estimation

Watching the Watchers: Exposing Gender Disparities in Machine Translation Quality Estimation

URL: http://arxiv.org/abs/2410.10995v1
Date: Mon, 14 Oct 2024 18:24:52 GMT
Title: Watching the Watchers: Exposing Gender Disparities in Machine Translation Quality Estimation
Authors: Emmanouil Zaranis, Giuseppe Attanasio, Sweta Agrawal, André F. T. Martins,
Abstract summary: This paper is the first to investigate gender bias in quality estimation (QE) metrics and its downstream impact on machine translation (MT) Masculine-inflected translations score higher than feminine-inflected ones, and gender-neutral translations are penalized. We show that QE metrics can perpetuate gender bias in MT systems when used in quality-aware decoding.
Score: 28.01631390361754
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The automatic assessment of translation quality has recently become crucial for many stages of the translation pipeline, from data curation to training and decoding. However, while quality estimation metrics have been optimized to align with human judgments, no attention has been given to these metrics' potential biases, particularly in reinforcing visibility and usability for some demographic groups over others. This paper is the first to investigate gender bias in quality estimation (QE) metrics and its downstream impact on machine translation (MT). We focus on out-of-English translations where the target language uses grammatical gender. We ask: (RQ1) Do contemporary QE metrics exhibit gender bias? (RQ2) Can the use of contextual information mitigate this bias? (RQ3) How does QE influence gender bias in MT outputs? Experiments with state-of-the-art QE metrics across multiple domains, datasets, and languages reveal significant bias. Masculine-inflected translations score higher than feminine-inflected ones, and gender-neutral translations are penalized. Moreover, context-aware QE metrics reduce errors for masculine-inflected references but fail to address feminine referents, exacerbating gender disparities. Additionally, we show that QE metrics can perpetuate gender bias in MT systems when used in quality-aware decoding. Our findings highlight the need to address gender bias in QE metrics to ensure equitable and unbiased MT systems.

Related papers

GG-BBQ: German Gender Bias Benchmark for Question Answering [1.4545246152596758]
We evaluate gender bias in German Large Language Models (LLMs) using the Bias Benchmark for Question Answering by Parrish et al.<n>Specifically, the templates in the gender identity subset of this English dataset were machine translated into German.<n>We find that manual revision of the translation is crucial when creating datasets for gender bias evaluation.
arXiv Detail & Related papers (2025-07-22T10:02:28Z)
Towards Fair Rankings: Leveraging LLMs for Gender Bias Detection and Measurement [6.92803536773427]
Social biases in Natural Language Processing (NLP) and Information Retrieval (IR) systems are an ongoing challenge.<n>We aim to address this issue by leveraging Large Language Models (LLMs) to detect and measure gender bias in passage ranking.<n>We introduce a novel gender fairness metric, named Class-wise Weighted Exposure (CWEx), aiming to address existing limitations.
arXiv Detail & Related papers (2025-06-27T16:39:12Z)
Are We Paying Attention to Her? Investigating Gender Disambiguation and Attention in Machine Translation [4.881426374773398]
We propose a novel evaluation metric called Minimal Pair Accuracy (MPA)<n>MPA focuses on whether models adapt to gender cues in minimal pairs.<n>MPA shows that in anti-stereotypical cases, NMT models tend to more consistently take masculine gender cues into account.
arXiv Detail & Related papers (2025-05-13T13:17:23Z)
Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering [68.3400058037817]
We introduce TREQA (Translation Evaluation via Question-Answering), a framework that extrinsically evaluates translation quality. We show that TREQA is competitive with and, in some cases, outperforms state-of-the-art neural and LLM-based metrics in ranking alternative paragraph-level translations.
arXiv Detail & Related papers (2025-04-10T09:24:54Z)
GenderCARE: A Comprehensive Framework for Assessing and Reducing Gender Bias in Large Language Models [73.23743278545321]
Large language models (LLMs) have exhibited remarkable capabilities in natural language generation, but have also been observed to magnify societal biases. GenderCARE is a comprehensive framework that encompasses innovative Criteria, bias Assessment, Reduction techniques, and Evaluation metrics.
arXiv Detail & Related papers (2024-08-22T15:35:46Z)
Beyond Binary Gender: Evaluating Gender-Inclusive Machine Translation with Ambiguous Attitude Words [85.48043537327258]
Existing machine translation gender bias evaluations are primarily focused on male and female genders. This study presents a benchmark AmbGIMT (Gender-Inclusive Machine Translation with Ambiguous attitude words) We propose a novel process to evaluate gender bias based on the Emotional Attitude Score (EAS), which is used to quantify ambiguous attitude words.
arXiv Detail & Related papers (2024-07-23T08:13:51Z)
Can Automatic Metrics Assess High-Quality Translations? [28.407966066693334]
We show that current metrics are insensitive to nuanced differences in translation quality. This effect is most pronounced when the quality is high and the variance among alternatives is low. Using the MQM framework as the gold standard, we systematically stress-test the ability of current metrics to identify translations with no errors as marked by humans.
arXiv Detail & Related papers (2024-05-28T16:44:02Z)
Whose wife is it anyway? Assessing bias against same-gender relationships in machine translation [26.676686759877597]
Machine translation often suffers from biased data and algorithms that can lead to unacceptable errors in system output. We investigate the degree of bias against same-gender relationships in MT systems. We find that three popular MT services consistently fail to accurately translate sentences concerning relationships between entities of the same gender.
arXiv Detail & Related papers (2024-01-10T07:33:32Z)
Gender Inflected or Bias Inflicted: On Using Grammatical Gender Cues for Bias Evaluation in Machine Translation [0.0]
We use Hindi as the source language and construct two sets of gender-specific sentences to evaluate different Hindi-English (HI-EN) NMT systems. Our work highlights the importance of considering the nature of language when designing such extrinsic bias evaluation datasets.
arXiv Detail & Related papers (2023-11-07T07:09:59Z)
A Tale of Pronouns: Interpretability Informs Gender Bias Mitigation for Fairer Instruction-Tuned Machine Translation [35.44115368160656]
We investigate whether and to what extent machine translation models exhibit gender bias. We find that IFT models default to male-inflected translations, even disregarding female occupational stereotypes. We propose an easy-to-implement and effective bias mitigation solution.
arXiv Detail & Related papers (2023-10-18T17:36:55Z)
The Gender-GAP Pipeline: A Gender-Aware Polyglot Pipeline for Gender Characterisation in 55 Languages [51.2321117760104]
This paper describes the Gender-GAP Pipeline, an automatic pipeline to characterize gender representation in large-scale datasets for 55 languages. The pipeline uses a multilingual lexicon of gendered person-nouns to quantify the gender representation in text. We showcase it to report gender representation in WMT training data and development data for the News task, confirming that current data is skewed towards masculine representation.
arXiv Detail & Related papers (2023-08-31T17:20:50Z)
BLEURT Has Universal Translations: An Analysis of Automatic Metrics by Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems. We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore. In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z)
Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level. We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z)
MT-GenEval: A Counterfactual and Contextual Dataset for Evaluating Gender Accuracy in Machine Translation [18.074541317458817]
We introduce MT-GenEval, a benchmark for evaluating gender accuracy in translation from English into eight languages. Our data and code are publicly available under a CC BY SA 3.0 license.
arXiv Detail & Related papers (2022-11-02T17:55:43Z)
Social Biases in Automatic Evaluation Metrics for NLG [53.76118154594404]
We propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics. We construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks.
arXiv Detail & Related papers (2022-10-17T08:55:26Z)
Mitigating Gender Bias in Machine Translation through Adversarial Learning [0.8883733362171032]
We present an adversarial learning framework that addresses challenges to mitigate gender bias in seq2seq machine translation. Our framework improves the disparity in translation quality for sentences with male vs. female entities by 86% for English-German translation and 91% for English-French translation.
arXiv Detail & Related papers (2022-03-20T23:35:09Z)
Improving Gender Translation Accuracy with Filtered Self-Training [14.938401898546548]
Machine translation systems often output incorrect gender, even when the gender is clear from context. We propose a gender-filtered self-training technique to improve gender translation accuracy on unambiguously gendered inputs.
arXiv Detail & Related papers (2021-04-15T18:05:29Z)
Multi-Dimensional Gender Bias Classification [67.65551687580552]
Machine learning models can inadvertently learn socially undesirable patterns when training on gender biased text. We propose a general framework that decomposes gender bias in text along several pragmatic and semantic dimensions. Using this fine-grained framework, we automatically annotate eight large scale datasets with gender information.
arXiv Detail & Related papers (2020-05-01T21:23:20Z)
Reducing Gender Bias in Neural Machine Translation as a Domain Adaptation Problem [21.44025591721678]
Training data for NLP tasks often exhibits gender bias in that fewer sentences refer to women than to men. Recent WinoMT challenge set allows us to measure this effect directly. We use transfer learning on a small set of trusted, gender-balanced examples.
arXiv Detail & Related papers (2020-04-09T11:55:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.