What Changed? Investigating Debiasing Methods using Causal Mediation
Analysis
- URL: http://arxiv.org/abs/2206.00701v1
- Date: Wed, 1 Jun 2022 18:26:24 GMT
- Title: What Changed? Investigating Debiasing Methods using Causal Mediation
Analysis
- Authors: Sullam Jeoung, Jana Diesner
- Abstract summary: We decompose the internal mechanisms of debiasing language models with respect to gender.
Our findings suggest a need to test the effectiveness of debiasing methods with different bias metrics.
- Score: 1.3225884668783203
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Previous work has examined how debiasing language models affect downstream
tasks, specifically, how debiasing techniques influence task performance and
whether debiased models also make impartial predictions in downstream tasks or
not. However, what we don't understand well yet is why debiasing methods have
varying impacts on downstream tasks and how debiasing techniques affect
internal components of language models, i.e., neurons, layers, and attentions.
In this paper, we decompose the internal mechanisms of debiasing language
models with respect to gender by applying causal mediation analysis to
understand the influence of debiasing methods on toxicity detection as a
downstream task. Our findings suggest a need to test the effectiveness of
debiasing methods with different bias metrics, and to focus on changes in the
behavior of certain components of the models, e.g.,first two layers of language
models, and attention heads.
Related papers
- Applying Intrinsic Debiasing on Downstream Tasks: Challenges and Considerations for Machine Translation [19.06428714669272]
We systematically test how methods for intrinsic debiasing affect neural machine translation models.
We highlight three challenges and mismatches between the debiasing techniques and their end-goal usage.
arXiv Detail & Related papers (2024-06-02T15:57:29Z) - Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination [54.865941973768905]
We propose a novel and practical bias mitigation method, CRISPR, to eliminate bias neurons of language models in instruction-following settings.
CRISPR automatically determines biased outputs and categorizes neurons that affect the biased outputs as bias neurons using an explainability method.
Experimental results demonstrate the effectiveness of our method in mitigating biases under zero-shot instruction-following settings without losing the model's task performance and existing knowledge.
arXiv Detail & Related papers (2023-11-16T07:16:55Z) - Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis [86.49858739347412]
Large Language Models (LLMs) have sparked intense debate regarding the prevalence of bias in these models and its mitigation.
We propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the decision process.
We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment.
arXiv Detail & Related papers (2023-11-15T00:02:25Z) - Identifying and Adapting Transformer-Components Responsible for Gender
Bias in an English Language Model [1.6343144783668118]
Language models (LMs) exhibit and amplify many types of undesirable biases learned from the training data, including gender bias.
We study three methods for identifying causal relations between LM components and particular output.
We apply the methods to GPT-2 small and the problem of gender bias, and use the discovered sets of components to perform parameter-efficient fine-tuning for bias mitigation.
arXiv Detail & Related papers (2023-10-19T09:39:21Z) - The Impact of Debiasing on the Performance of Language Models in
Downstream Tasks is Underestimated [70.23064111640132]
We compare the impact of debiasing on performance across multiple downstream tasks using a wide-range of benchmark datasets.
Experiments show that the effects of debiasing are consistently emphunderestimated across all tasks.
arXiv Detail & Related papers (2023-09-16T20:25:34Z) - Data augmentation and explainability for bias discovery and mitigation
in deep learning [0.0]
This dissertation explores the impact of bias in deep neural networks and presents methods for reducing its influence on model performance.
The first part begins by categorizing and describing potential sources of bias and errors in data and models, with a particular focus on bias in machine learning pipelines.
The next chapter outlines a taxonomy and methods of Explainable AI as a way to justify predictions and control and improve the model.
arXiv Detail & Related papers (2023-08-18T11:02:27Z) - An Empirical Survey of the Effectiveness of Debiasing Techniques for
Pre-Trained Language Models [4.937002982255573]
Recent work has shown that pre-trained language models capture social biases from the text corpora they are trained on.
Five recently proposed debiasing techniques: Counterfactual Data Augmentation, Dropout, Iterative Nullspace Projection, Self-Debias, and SentenceDebias.
We quantify the effectiveness of each technique using three different bias benchmarks while also measuring the impact of these techniques on a model's language modeling ability.
arXiv Detail & Related papers (2021-10-16T09:40:30Z) - Balancing out Bias: Achieving Fairness Through Training Reweighting [58.201275105195485]
Bias in natural language processing arises from models learning characteristics of the author such as gender and race.
Existing methods for mitigating and measuring bias do not directly account for correlations between author demographics and linguistic variables.
This paper introduces a very simple but highly effective method for countering bias using instance reweighting.
arXiv Detail & Related papers (2021-09-16T23:40:28Z) - Learning from others' mistakes: Avoiding dataset biases without modeling
them [111.17078939377313]
State-of-the-art natural language processing (NLP) models often learn to model dataset biases and surface form correlations instead of features that target the intended task.
Previous work has demonstrated effective methods to circumvent these issues when knowledge of the bias is available.
We show a method for training models that learn to ignore these problematic correlations.
arXiv Detail & Related papers (2020-12-02T16:10:54Z) - Causal Mediation Analysis for Interpreting Neural NLP: The Case of
Gender Bias [45.956112337250275]
We propose a methodology grounded in the theory of causal mediation analysis for interpreting which parts of a model are causally implicated in its behavior.
We apply this methodology to analyze gender bias in pre-trained Transformer language models.
Our mediation analysis reveals that gender bias effects are (i) sparse, concentrated in a small part of the network; (ii) synergistic, amplified or repressed by different components; and (iii) decomposable into effects flowing directly from the input and indirectly through the mediators.
arXiv Detail & Related papers (2020-04-26T01:53:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.