FairBelief - Assessing Harmful Beliefs in Language Models
- URL: http://arxiv.org/abs/2402.17389v1
- Date: Tue, 27 Feb 2024 10:31:00 GMT
- Title: FairBelief - Assessing Harmful Beliefs in Language Models
- Authors: Mattia Setzu, Marta Marchiori Manerba, Pasquale Minervini, Debora
Nozza
- Abstract summary: Language Models (LMs) have been shown to inherit undesired biases that might hurt minorities and underrepresented groups if such systems were integrated into real-world applications without careful fairness auditing.
This paper proposes FairBelief, an analytical approach to capture and assess beliefs, i.e., propositions that an LM may embed with different degrees of confidence and that covertly influence its predictions.
- Score: 25.032952666134157
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language Models (LMs) have been shown to inherit undesired biases that might
hurt minorities and underrepresented groups if such systems were integrated
into real-world applications without careful fairness auditing. This paper
proposes FairBelief, an analytical approach to capture and assess beliefs,
i.e., propositions that an LM may embed with different degrees of confidence
and that covertly influence its predictions. With FairBelief, we leverage
prompting to study the behavior of several state-of-the-art LMs across
different previously neglected axes, such as model scale and likelihood,
assessing predictions on a fairness dataset specifically designed to quantify
LMs' outputs' hurtfulness. Finally, we conclude with an in-depth qualitative
assessment of the beliefs emitted by the models. We apply FairBelief to English
LMs, revealing that, although these architectures enable high performances on
diverse natural language processing tasks, they show hurtful beliefs about
specific genders. Interestingly, training procedure and dataset, model scale,
and architecture induce beliefs of different degrees of hurtfulness.
Related papers
- Fairness Evaluation with Item Response Theory [10.871079276188649]
This paper proposes a novel Fair-IRT framework to evaluate fairness in Machine Learning (ML) models.
Detailed explanations for item characteristic curves (ICCs) are provided for particular individuals.
Experiments demonstrate the effectiveness of this framework as a fairness evaluation tool.
arXiv Detail & Related papers (2024-10-20T22:25:20Z) - Fact Recall, Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion [9.383571944693188]
We study four different prediction scenarios for which the LM can be expected to show distinct behaviors.
We propose a model-specific recipe called PrISM for constructing datasets with examples of each scenario.
We find that while CT produces different results for each scenario, aggregations over a set of mixed examples may only represent the results from the scenario with the strongest measured signal.
arXiv Detail & Related papers (2024-10-18T12:08:07Z) - Editable Fairness: Fine-Grained Bias Mitigation in Language Models [52.66450426729818]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases.
FAST surpasses state-of-the-art baselines with superior debiasing performance.
This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z) - Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers [13.644277507363036]
We investigate whether these abilities are measurable outside of tailored prompting and MCQ.
Our findings suggest that the Revealed Belief of LLMs significantly differs from their Stated Answer.
As text completion is at the core of LLMs, these results suggest that common evaluation methods may only provide a partial picture.
arXiv Detail & Related papers (2024-06-21T08:56:35Z) - Social Bias Probing: Fairness Benchmarking for Language Models [38.180696489079985]
This paper proposes a novel framework for probing language models for social biases by assessing disparate treatment.
We curate SoFa, a large-scale benchmark designed to address the limitations of existing fairness collections.
We show that biases within language models are more nuanced than acknowledged, indicating a broader scope of encoded biases than previously recognized.
arXiv Detail & Related papers (2023-11-15T16:35:59Z) - All Should Be Equal in the Eyes of Language Models: Counterfactually
Aware Fair Text Generation [16.016546693767403]
We propose a framework that dynamically compares the model understanding of diverse demographics to generate more equitable sentences.
CAFIE produces fairer text and strikes the best balance between fairness and language modeling capability.
arXiv Detail & Related papers (2023-11-09T15:39:40Z) - Bias and Fairness in Large Language Models: A Survey [73.87651986156006]
We present a comprehensive survey of bias evaluation and mitigation techniques for large language models (LLMs)
We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing.
We then unify the literature by proposing three intuitive, two for bias evaluation, and one for mitigation.
arXiv Detail & Related papers (2023-09-02T00:32:55Z) - Learning for Counterfactual Fairness from Observational Data [62.43249746968616]
Fairness-aware machine learning aims to eliminate biases of learning models against certain subgroups described by certain protected (sensitive) attributes such as race, gender, and age.
A prerequisite for existing methods to achieve counterfactual fairness is the prior human knowledge of the causal model for the data.
In this work, we address the problem of counterfactually fair prediction from observational data without given causal models by proposing a novel framework CLAIRE.
arXiv Detail & Related papers (2023-07-17T04:08:29Z) - Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial.
We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments.
The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z) - Measuring Fairness of Text Classifiers via Prediction Sensitivity [63.56554964580627]
ACCUMULATED PREDICTION SENSITIVITY measures fairness in machine learning models based on the model's prediction sensitivity to perturbations in input features.
We show that the metric can be theoretically linked with a specific notion of group fairness (statistical parity) and individual fairness.
arXiv Detail & Related papers (2022-03-16T15:00:33Z) - Towards Understanding and Mitigating Social Biases in Language Models [107.82654101403264]
Large-scale pretrained language models (LMs) can be potentially dangerous in manifesting undesirable representational biases.
We propose steps towards mitigating social biases during text generation.
Our empirical results and human evaluation demonstrate effectiveness in mitigating bias while retaining crucial contextual information.
arXiv Detail & Related papers (2021-06-24T17:52:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.