Choose Your Lenses: Flaws in Gender Bias Evaluation
- URL: http://arxiv.org/abs/2210.11471v1
- Date: Thu, 20 Oct 2022 17:59:55 GMT
- Title: Choose Your Lenses: Flaws in Gender Bias Evaluation
- Authors: Hadas Orgad and Yonatan Belinkov
- Abstract summary: We assess the current paradigm of gender bias evaluation and identify several flaws in it.
First, we highlight the importance of extrinsic bias metrics that measure how a model's performance on some task is affected by gender.
Second, we find that datasets and metrics are often coupled, and discuss how their coupling hinders the ability to obtain reliable conclusions.
- Score: 29.16221451643288
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Considerable efforts to measure and mitigate gender bias in recent years have
led to the introduction of an abundance of tasks, datasets, and metrics used in
this vein. In this position paper, we assess the current paradigm of gender
bias evaluation and identify several flaws in it. First, we highlight the
importance of extrinsic bias metrics that measure how a model's performance on
some task is affected by gender, as opposed to intrinsic evaluations of model
representations, which are less strongly connected to specific harms to people
interacting with systems. We find that only a few extrinsic metrics are
measured in most studies, although more can be measured. Second, we find that
datasets and metrics are often coupled, and discuss how their coupling hinders
the ability to obtain reliable conclusions, and how one may decouple them. We
then investigate how the choice of the dataset and its composition, as well as
the choice of the metric, affect bias measurement, finding significant
variations across each of them. Finally, we propose several guidelines for more
reliable gender bias evaluation.
Related papers
- Comprehensive Equity Index (CEI): Definition and Application to Bias Evaluation in Biometrics [47.762333925222926]
We present a novel metric to quantify biased behaviors of machine learning models.
We focus on and apply it to the operational evaluation of face recognition systems.
arXiv Detail & Related papers (2024-09-03T14:19:38Z) - As Biased as You Measure: Methodological Pitfalls of Bias Evaluations in Speaker Verification Research [15.722009470067974]
We investigate how measurement impacts the outcomes of bias evaluations.
We show that bias evaluations are strongly influenced by base metrics that measure performance.
Based on our findings, we recommend the use of ratio-based bias measures.
arXiv Detail & Related papers (2024-08-24T16:04:51Z) - The Impact of Debiasing on the Performance of Language Models in
Downstream Tasks is Underestimated [70.23064111640132]
We compare the impact of debiasing on performance across multiple downstream tasks using a wide-range of benchmark datasets.
Experiments show that the effects of debiasing are consistently emphunderestimated across all tasks.
arXiv Detail & Related papers (2023-09-16T20:25:34Z) - Gender Biases in Automatic Evaluation Metrics for Image Captioning [87.15170977240643]
We conduct a systematic study of gender biases in model-based evaluation metrics for image captioning tasks.
We demonstrate the negative consequences of using these biased metrics, including the inability to differentiate between biased and unbiased generations.
We present a simple and effective way to mitigate the metric bias without hurting the correlations with human judgments.
arXiv Detail & Related papers (2023-05-24T04:27:40Z) - Counter-GAP: Counterfactual Bias Evaluation through Gendered Ambiguous
Pronouns [53.62845317039185]
Bias-measuring datasets play a critical role in detecting biased behavior of language models.
We propose a novel method to collect diverse, natural, and minimally distant text pairs via counterfactual generation.
We show that four pre-trained language models are significantly more inconsistent across different gender groups than within each group.
arXiv Detail & Related papers (2023-02-11T12:11:03Z) - MABEL: Attenuating Gender Bias using Textual Entailment Data [20.489427903240017]
We propose MABEL, an intermediate pre-training approach for mitigating gender bias in contextualized representations.
Key to our approach is the use of a contrastive learning objective on counterfactually augmented, gender-balanced entailment pairs.
We show that MABEL outperforms previous task-agnostic debiasing approaches in terms of fairness.
arXiv Detail & Related papers (2022-10-26T18:36:58Z) - Social Biases in Automatic Evaluation Metrics for NLG [53.76118154594404]
We propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics.
We construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks.
arXiv Detail & Related papers (2022-10-17T08:55:26Z) - D-BIAS: A Causality-Based Human-in-the-Loop System for Tackling
Algorithmic Bias [57.87117733071416]
We propose D-BIAS, a visual interactive tool that embodies human-in-the-loop AI approach for auditing and mitigating social biases.
A user can detect the presence of bias against a group by identifying unfair causal relationships in the causal network.
For each interaction, say weakening/deleting a biased causal edge, the system uses a novel method to simulate a new (debiased) dataset.
arXiv Detail & Related papers (2022-08-10T03:41:48Z) - Evaluating Metrics for Bias in Word Embeddings [44.14639209617701]
We formalize a bias definition based on the ideas from previous works and derive conditions for bias metrics.
We propose a new metric, SAME, to address the shortcomings of existing metrics and mathematically prove that SAME behaves appropriately.
arXiv Detail & Related papers (2021-11-15T16:07:15Z) - Intrinsic Bias Metrics Do Not Correlate with Application Bias [12.588713044749179]
This research examines whether easy-to-measure intrinsic metrics correlate well to real world extrinsic metrics.
We measure both intrinsic and extrinsic bias across hundreds of trained models covering different tasks and experimental conditions.
We advise that efforts to debias embedding spaces be always also paired with measurement of downstream model bias, and suggest that that community increase effort into making downstream measurement more feasible via creation of additional challenge sets and annotated test data.
arXiv Detail & Related papers (2020-12-31T18:59:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.