Assessing the Reliability of Word Embedding Gender Bias Measures
- URL: http://arxiv.org/abs/2109.04732v1
- Date: Fri, 10 Sep 2021 08:23:50 GMT
- Title: Assessing the Reliability of Word Embedding Gender Bias Measures
- Authors: Yupei Du, Qixiang Fang, Dong Nguyen
- Abstract summary: We assess three types of reliability of word embedding gender bias measures, namely test-retest reliability, inter-rater consistency and internal consistency.
Our findings inform better design of word embedding gender bias measures.
- Score: 4.258396452892244
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Various measures have been proposed to quantify human-like social biases in
word embeddings. However, bias scores based on these measures can suffer from
measurement error. One indication of measurement quality is reliability,
concerning the extent to which a measure produces consistent results. In this
paper, we assess three types of reliability of word embedding gender bias
measures, namely test-retest reliability, inter-rater consistency and internal
consistency. Specifically, we investigate the consistency of bias scores across
different choices of random seeds, scoring rules and words. Furthermore, we
analyse the effects of various factors on these measures' reliability scores.
Our findings inform better design of word embedding gender bias measures.
Moreover, we urge researchers to be more critical about the application of such
measures.
Related papers
- As Biased as You Measure: Methodological Pitfalls of Bias Evaluations in Speaker Verification Research [15.722009470067974]
We investigate how measurement impacts the outcomes of bias evaluations.
We show that bias evaluations are strongly influenced by base metrics that measure performance.
Based on our findings, we recommend the use of ratio-based bias measures.
arXiv Detail & Related papers (2024-08-24T16:04:51Z) - Semantic Properties of cosine based bias scores for word embeddings [48.0753688775574]
We propose requirements for bias scores to be considered meaningful for quantifying biases.
We analyze cosine based scores from the literature with regard to these requirements.
We underline these findings with experiments to show that the bias scores' limitations have an impact in the application case.
arXiv Detail & Related papers (2024-01-27T20:31:10Z) - Trustworthy Social Bias Measurement [92.87080873893618]
In this work, we design bias measures that warrant trust based on the cross-disciplinary theory of measurement modeling.
We operationalize our definition by proposing a general bias measurement framework DivDist, which we use to instantiate 5 concrete bias measures.
We demonstrate considerable evidence to trust our measures, showing they overcome conceptual, technical, and empirical deficiencies present in prior measures.
arXiv Detail & Related papers (2022-12-20T18:45:12Z) - Choose Your Lenses: Flaws in Gender Bias Evaluation [29.16221451643288]
We assess the current paradigm of gender bias evaluation and identify several flaws in it.
First, we highlight the importance of extrinsic bias metrics that measure how a model's performance on some task is affected by gender.
Second, we find that datasets and metrics are often coupled, and discuss how their coupling hinders the ability to obtain reliable conclusions.
arXiv Detail & Related papers (2022-10-20T17:59:55Z) - Social Biases in Automatic Evaluation Metrics for NLG [53.76118154594404]
We propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics.
We construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks.
arXiv Detail & Related papers (2022-10-17T08:55:26Z) - The SAME score: Improved cosine based bias score for word embeddings [49.75878234192369]
We introduce SAME, a novel bias score for semantic bias in embeddings.
We show that SAME is capable of measuring semantic bias and identify potential causes for social bias in downstream tasks.
arXiv Detail & Related papers (2022-03-28T09:28:13Z) - Information-Theoretic Bias Reduction via Causal View of Spurious
Correlation [71.9123886505321]
We propose an information-theoretic bias measurement technique through a causal interpretation of spurious correlation.
We present a novel debiasing framework against the algorithmic bias, which incorporates a bias regularization loss.
The proposed bias measurement and debiasing approaches are validated in diverse realistic scenarios.
arXiv Detail & Related papers (2022-01-10T01:19:31Z) - Evaluating Metrics for Bias in Word Embeddings [44.14639209617701]
We formalize a bias definition based on the ideas from previous works and derive conditions for bias metrics.
We propose a new metric, SAME, to address the shortcomings of existing metrics and mathematically prove that SAME behaves appropriately.
arXiv Detail & Related papers (2021-11-15T16:07:15Z) - What do Bias Measures Measure? [41.36968251743058]
Natural Language Processing models propagate social biases about protected attributes such as gender, race, and nationality.
To create interventions and mitigate these biases and associated harms, it is vital to be able to detect and measure such biases.
This work presents a comprehensive survey of existing bias measures in NLP as a function of the associated NLP tasks, metrics, datasets, and social biases and corresponding harms.
arXiv Detail & Related papers (2021-08-07T04:08:47Z) - On the Interpretability and Significance of Bias Metrics in Texts: a
PMI-based Approach [3.2326259807823026]
We analyze an alternative PMI-based metric to quantify biases in texts.
It can be expressed as a function of conditional probabilities, which provides a simple interpretation in terms of word co-occurrences.
arXiv Detail & Related papers (2021-04-13T19:34:17Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.