Related papers: Choose Your Lenses: Flaws in Gender Bias Evaluation

Choose Your Lenses: Flaws in Gender Bias Evaluation

URL: http://arxiv.org/abs/2210.11471v1
Date: Thu, 20 Oct 2022 17:59:55 GMT
Title: Choose Your Lenses: Flaws in Gender Bias Evaluation
Authors: Hadas Orgad and Yonatan Belinkov
Abstract summary: We assess the current paradigm of gender bias evaluation and identify several flaws in it. First, we highlight the importance of extrinsic bias metrics that measure how a model's performance on some task is affected by gender. Second, we find that datasets and metrics are often coupled, and discuss how their coupling hinders the ability to obtain reliable conclusions.
Score: 29.16221451643288
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Considerable efforts to measure and mitigate gender bias in recent years have led to the introduction of an abundance of tasks, datasets, and metrics used in this vein. In this position paper, we assess the current paradigm of gender bias evaluation and identify several flaws in it. First, we highlight the importance of extrinsic bias metrics that measure how a model's performance on some task is affected by gender, as opposed to intrinsic evaluations of model representations, which are less strongly connected to specific harms to people interacting with systems. We find that only a few extrinsic metrics are measured in most studies, although more can be measured. Second, we find that datasets and metrics are often coupled, and discuss how their coupling hinders the ability to obtain reliable conclusions, and how one may decouple them. We then investigate how the choice of the dataset and its composition, as well as the choice of the metric, affect bias measurement, finding significant variations across each of them. Finally, we propose several guidelines for more reliable gender bias evaluation.

Related papers

Towards Fair Rankings: Leveraging LLMs for Gender Bias Detection and Measurement [6.92803536773427]
Social biases in Natural Language Processing (NLP) and Information Retrieval (IR) systems are an ongoing challenge.<n>We aim to address this issue by leveraging Large Language Models (LLMs) to detect and measure gender bias in passage ranking.<n>We introduce a novel gender fairness metric, named Class-wise Weighted Exposure (CWEx), aiming to address existing limitations.
arXiv Detail & Related papers (2025-06-27T16:39:12Z)
EMO-Debias: Benchmarking Gender Debiasing Techniques in Multi-Label Speech Emotion Recognition [49.27067541740956]
EMO-Debias is a large-scale comparison of 13 debiasing methods applied to multi-label SER.<n>Our study encompasses techniques from pre-processing, regularization, adversarial learning, biased learners, and distributionally robust optimization.<n>Our analysis quantifies the trade-offs between fairness and accuracy, identifying which approaches consistently reduce gender performance gaps.
arXiv Detail & Related papers (2025-06-05T05:48:31Z)
BiasConnect: Investigating Bias Interactions in Text-to-Image Models [73.76853483463836]
We introduce BiasConnect, a novel tool designed to analyze and quantify bias interactions in Text-to-Image models. Our method provides empirical estimates that indicate how other bias dimensions shift toward or away from an ideal distribution when a given bias is modified. We demonstrate the utility of BiasConnect for selecting optimal bias mitigation axes, comparing different TTI models on the dependencies they learn, and understanding the amplification of intersectional societal biases in TTI models.
arXiv Detail & Related papers (2025-03-12T19:01:41Z)
Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets [17.101242741559428]
This paper focuses on intrinsic bias mitigation and measurement strategies for language models. We delve deeper into intrinsic measurements, identifying inconsistencies and suggesting that these benchmarks may reflect different facets of gender stereotype. Our findings underscore the complexity of gender stereotyping in language models and point to new directions for developing more refined techniques to detect and reduce bias.
arXiv Detail & Related papers (2025-01-02T09:40:31Z)
Comprehensive Equity Index (CEI): Definition and Application to Bias Evaluation in Biometrics [47.762333925222926]
We present a novel metric to quantify biased behaviors of machine learning models. We focus on and apply it to the operational evaluation of face recognition systems.
arXiv Detail & Related papers (2024-09-03T14:19:38Z)
As Biased as You Measure: Methodological Pitfalls of Bias Evaluations in Speaker Verification Research [15.722009470067974]
We investigate how measurement impacts the outcomes of bias evaluations. We show that bias evaluations are strongly influenced by base metrics that measure performance. Based on our findings, we recommend the use of ratio-based bias measures.
arXiv Detail & Related papers (2024-08-24T16:04:51Z)
The Impact of Debiasing on the Performance of Language Models in Downstream Tasks is Underestimated [70.23064111640132]
We compare the impact of debiasing on performance across multiple downstream tasks using a wide-range of benchmark datasets. Experiments show that the effects of debiasing are consistently emphunderestimated across all tasks.
arXiv Detail & Related papers (2023-09-16T20:25:34Z)
Gender Biases in Automatic Evaluation Metrics for Image Captioning [87.15170977240643]
We conduct a systematic study of gender biases in model-based evaluation metrics for image captioning tasks. We demonstrate the negative consequences of using these biased metrics, including the inability to differentiate between biased and unbiased generations. We present a simple and effective way to mitigate the metric bias without hurting the correlations with human judgments.
arXiv Detail & Related papers (2023-05-24T04:27:40Z)
Counter-GAP: Counterfactual Bias Evaluation through Gendered Ambiguous Pronouns [53.62845317039185]
Bias-measuring datasets play a critical role in detecting biased behavior of language models. We propose a novel method to collect diverse, natural, and minimally distant text pairs via counterfactual generation. We show that four pre-trained language models are significantly more inconsistent across different gender groups than within each group.
arXiv Detail & Related papers (2023-02-11T12:11:03Z)
MABEL: Attenuating Gender Bias using Textual Entailment Data [20.489427903240017]
We propose MABEL, an intermediate pre-training approach for mitigating gender bias in contextualized representations. Key to our approach is the use of a contrastive learning objective on counterfactually augmented, gender-balanced entailment pairs. We show that MABEL outperforms previous task-agnostic debiasing approaches in terms of fairness.
arXiv Detail & Related papers (2022-10-26T18:36:58Z)
Social Biases in Automatic Evaluation Metrics for NLG [53.76118154594404]
We propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics. We construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks.
arXiv Detail & Related papers (2022-10-17T08:55:26Z)
D-BIAS: A Causality-Based Human-in-the-Loop System for Tackling Algorithmic Bias [57.87117733071416]
We propose D-BIAS, a visual interactive tool that embodies human-in-the-loop AI approach for auditing and mitigating social biases. A user can detect the presence of bias against a group by identifying unfair causal relationships in the causal network. For each interaction, say weakening/deleting a biased causal edge, the system uses a novel method to simulate a new (debiased) dataset.
arXiv Detail & Related papers (2022-08-10T03:41:48Z)
Evaluating Metrics for Bias in Word Embeddings [44.14639209617701]
We formalize a bias definition based on the ideas from previous works and derive conditions for bias metrics. We propose a new metric, SAME, to address the shortcomings of existing metrics and mathematically prove that SAME behaves appropriately.
arXiv Detail & Related papers (2021-11-15T16:07:15Z)
Intrinsic Bias Metrics Do Not Correlate with Application Bias [12.588713044749179]
This research examines whether easy-to-measure intrinsic metrics correlate well to real world extrinsic metrics. We measure both intrinsic and extrinsic bias across hundreds of trained models covering different tasks and experimental conditions. We advise that efforts to debias embedding spaces be always also paired with measurement of downstream model bias, and suggest that that community increase effort into making downstream measurement more feasible via creation of additional challenge sets and annotated test data.
arXiv Detail & Related papers (2020-12-31T18:59:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.