Causal Mediation Analysis for Interpreting Neural NLP: The Case of
Gender Bias
- URL: http://arxiv.org/abs/2004.12265v2
- Date: Sun, 22 Nov 2020 07:58:08 GMT
- Title: Causal Mediation Analysis for Interpreting Neural NLP: The Case of
Gender Bias
- Authors: Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel
Nevo, Simas Sakenis, Jason Huang, Yaron Singer, Stuart Shieber
- Abstract summary: We propose a methodology grounded in the theory of causal mediation analysis for interpreting which parts of a model are causally implicated in its behavior.
We apply this methodology to analyze gender bias in pre-trained Transformer language models.
Our mediation analysis reveals that gender bias effects are (i) sparse, concentrated in a small part of the network; (ii) synergistic, amplified or repressed by different components; and (iii) decomposable into effects flowing directly from the input and indirectly through the mediators.
- Score: 45.956112337250275
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Common methods for interpreting neural models in natural language processing
typically examine either their structure or their behavior, but not both. We
propose a methodology grounded in the theory of causal mediation analysis for
interpreting which parts of a model are causally implicated in its behavior. It
enables us to analyze the mechanisms by which information flows from input to
output through various model components, known as mediators. We apply this
methodology to analyze gender bias in pre-trained Transformer language models.
We study the role of individual neurons and attention heads in mediating gender
bias across three datasets designed to gauge a model's sensitivity to gender
bias. Our mediation analysis reveals that gender bias effects are (i) sparse,
concentrated in a small part of the network; (ii) synergistic, amplified or
repressed by different components; and (iii) decomposable into effects flowing
directly from the input and indirectly through the mediators.
Related papers
- Locating and Mitigating Gender Bias in Large Language Models [40.78150878350479]
Large language models (LLM) are pre-trained on extensive corpora to learn facts and human cognition which contain human preferences.
This process can inadvertently lead to these models acquiring biases and prevalent stereotypes in society.
We propose the LSDM (Least Square Debias Method), a knowledge-editing based method for mitigating gender bias in occupational pronouns.
arXiv Detail & Related papers (2024-03-21T13:57:43Z) - Identifying and Adapting Transformer-Components Responsible for Gender
Bias in an English Language Model [1.6343144783668118]
Language models (LMs) exhibit and amplify many types of undesirable biases learned from the training data, including gender bias.
We study three methods for identifying causal relations between LM components and particular output.
We apply the methods to GPT-2 small and the problem of gender bias, and use the discovered sets of components to perform parameter-efficient fine-tuning for bias mitigation.
arXiv Detail & Related papers (2023-10-19T09:39:21Z) - The Birth of Bias: A case study on the evolution of gender bias in an
English language model [1.6344851071810076]
We use a relatively small language model, using the LSTM architecture trained on an English Wikipedia corpus.
We find that the representation of gender is dynamic and identify different phases during training.
We show that gender information is represented increasingly locally in the input embeddings of the model.
arXiv Detail & Related papers (2022-07-21T00:59:04Z) - What Changed? Investigating Debiasing Methods using Causal Mediation
Analysis [1.3225884668783203]
We decompose the internal mechanisms of debiasing language models with respect to gender.
Our findings suggest a need to test the effectiveness of debiasing methods with different bias metrics.
arXiv Detail & Related papers (2022-06-01T18:26:24Z) - Naturalistic Causal Probing for Morpho-Syntax [76.83735391276547]
We suggest a naturalistic strategy for input-level intervention on real world data in Spanish.
Using our approach, we isolate morpho-syntactic features from counfounders in sentences.
We apply this methodology to analyze causal effects of gender and number on contextualized representations extracted from pre-trained models.
arXiv Detail & Related papers (2022-05-14T11:47:58Z) - Word Embeddings via Causal Inference: Gender Bias Reducing and Semantic
Information Preserving [3.114945725130788]
We propose a novel methodology that leverages a causal inference framework to effectively remove gender bias.
Our comprehensive experiments show that the proposed method achieves state-of-the-art results in gender-debiasing tasks.
arXiv Detail & Related papers (2021-12-09T19:57:22Z) - Balancing out Bias: Achieving Fairness Through Training Reweighting [58.201275105195485]
Bias in natural language processing arises from models learning characteristics of the author such as gender and race.
Existing methods for mitigating and measuring bias do not directly account for correlations between author demographics and linguistic variables.
This paper introduces a very simple but highly effective method for countering bias using instance reweighting.
arXiv Detail & Related papers (2021-09-16T23:40:28Z) - Analyzing the Source and Target Contributions to Predictions in Neural
Machine Translation [97.22768624862111]
We analyze NMT models which explicitly evaluates the source and target relative contributions to the generation process.
We find that models trained with more data tend to rely on source information more and to have more sharp token contributions.
arXiv Detail & Related papers (2020-10-21T11:37:27Z) - LOGAN: Local Group Bias Detection by Clustering [86.38331353310114]
We argue that evaluating bias at the corpus level is not enough for understanding how biases are embedded in a model.
We propose LOGAN, a new bias detection technique based on clustering.
Experiments on toxicity classification and object classification tasks show that LOGAN identifies bias in a local region.
arXiv Detail & Related papers (2020-10-06T16:42:51Z) - InsideBias: Measuring Bias in Deep Networks and Application to Face
Gender Biometrics [73.85525896663371]
This work explores the biases in learning processes based on deep neural network architectures.
We employ two gender detection models based on popular deep neural networks.
We propose InsideBias, a novel method to detect biased models.
arXiv Detail & Related papers (2020-04-14T15:20:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.