On the Interpretability and Significance of Bias Metrics in Texts: a
PMI-based Approach
- URL: http://arxiv.org/abs/2104.06474v2
- Date: Tue, 18 Jul 2023 16:40:41 GMT
- Title: On the Interpretability and Significance of Bias Metrics in Texts: a
PMI-based Approach
- Authors: Francisco Valentini, Germ\'an Rosati, Dami\'an Blasi, Diego Fernandez
Slezak, and Edgar Altszyler
- Abstract summary: We analyze an alternative PMI-based metric to quantify biases in texts.
It can be expressed as a function of conditional probabilities, which provides a simple interpretation in terms of word co-occurrences.
- Score: 3.2326259807823026
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, word embeddings have been widely used to measure biases in
texts. Even if they have proven to be effective in detecting a wide variety of
biases, metrics based on word embeddings lack transparency and
interpretability. We analyze an alternative PMI-based metric to quantify biases
in texts. It can be expressed as a function of conditional probabilities, which
provides a simple interpretation in terms of word co-occurrences. We also prove
that it can be approximated by an odds ratio, which allows estimating
confidence intervals and statistical significance of textual biases. This
approach produces similar results to metrics based on word embeddings when
capturing gender gaps of the real world embedded in large corpora.
Related papers
- Analyzing Correlations Between Intrinsic and Extrinsic Bias Metrics of Static Word Embeddings With Their Measuring Biases Aligned [8.673018064714547]
We examine the abilities of intrinsic bias metrics of static word embeddings to predict whether Natural Language Processing (NLP) systems exhibit biased behavior.
A word embedding is one of the fundamental NLP technologies that represents the meanings of words through real vectors, and problematically, it also learns social biases such as stereotypes.
arXiv Detail & Related papers (2024-09-14T02:13:56Z) - Exploring the Correlation between Human and Machine Evaluation of Simultaneous Speech Translation [0.9576327614980397]
This study aims to assess the reliability of automatic metrics in evaluating simultaneous interpretations by analyzing their correlation with human evaluations.
As a benchmark we use human assessments performed by language experts, and evaluate how well sentence embeddings and Large Language Models correlate with them.
The results suggest GPT models, particularly GPT-3.5 with direct prompting, demonstrate the strongest correlation with human judgment in terms of semantic similarity between source and target texts.
arXiv Detail & Related papers (2024-06-14T14:47:19Z) - COBIAS: Contextual Reliability in Bias Assessment [14.594920595573038]
Large Language Models (LLMs) often inherit biases from the web data they are trained on, which contains stereotypes and prejudices.
Current methods for evaluating and mitigating these biases rely on bias-benchmark datasets.
We introduce a contextual reliability framework, which evaluates model robustness to biased statements by considering the various contexts in which they may appear.
arXiv Detail & Related papers (2024-02-22T10:46:11Z) - Goodhart's Law Applies to NLP's Explanation Benchmarks [57.26445915212884]
We critically examine two sets of metrics: the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics.
We show that we can inflate a model's comprehensiveness and sufficiency scores dramatically without altering its predictions or explanations on in-distribution test inputs.
Our results raise doubts about the ability of current metrics to guide explainability research, underscoring the need for a broader reassessment of what precisely these metrics are intended to capture.
arXiv Detail & Related papers (2023-08-28T03:03:03Z) - Measuring Fairness of Text Classifiers via Prediction Sensitivity [63.56554964580627]
ACCUMULATED PREDICTION SENSITIVITY measures fairness in machine learning models based on the model's prediction sensitivity to perturbations in input features.
We show that the metric can be theoretically linked with a specific notion of group fairness (statistical parity) and individual fairness.
arXiv Detail & Related papers (2022-03-16T15:00:33Z) - Evaluating Metrics for Bias in Word Embeddings [44.14639209617701]
We formalize a bias definition based on the ideas from previous works and derive conditions for bias metrics.
We propose a new metric, SAME, to address the shortcomings of existing metrics and mathematically prove that SAME behaves appropriately.
arXiv Detail & Related papers (2021-11-15T16:07:15Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Balancing out Bias: Achieving Fairness Through Training Reweighting [58.201275105195485]
Bias in natural language processing arises from models learning characteristics of the author such as gender and race.
Existing methods for mitigating and measuring bias do not directly account for correlations between author demographics and linguistic variables.
This paper introduces a very simple but highly effective method for countering bias using instance reweighting.
arXiv Detail & Related papers (2021-09-16T23:40:28Z) - Assessing the Reliability of Word Embedding Gender Bias Measures [4.258396452892244]
We assess three types of reliability of word embedding gender bias measures, namely test-retest reliability, inter-rater consistency and internal consistency.
Our findings inform better design of word embedding gender bias measures.
arXiv Detail & Related papers (2021-09-10T08:23:50Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z) - On the Relation between Quality-Diversity Evaluation and
Distribution-Fitting Goal in Text Generation [86.11292297348622]
We show that a linear combination of quality and diversity constitutes a divergence metric between the generated distribution and the real distribution.
We propose CR/NRR as a substitute for quality/diversity metric pair.
arXiv Detail & Related papers (2020-07-03T04:06:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.