Related papers: Identifying and Adapting Transformer-Components Responsible for Gender Bias in an English Language Model

Identifying and Adapting Transformer-Components Responsible for Gender Bias in an English Language Model

URL: http://arxiv.org/abs/2310.12611v1
Date: Thu, 19 Oct 2023 09:39:21 GMT
Title: Identifying and Adapting Transformer-Components Responsible for Gender Bias in an English Language Model
Authors: Abhijith Chintam, Rahel Beloch, Willem Zuidema, Michael Hanna and Oskar van der Wal
Abstract summary: Language models (LMs) exhibit and amplify many types of undesirable biases learned from the training data, including gender bias. We study three methods for identifying causal relations between LM components and particular output. We apply the methods to GPT-2 small and the problem of gender bias, and use the discovered sets of components to perform parameter-efficient fine-tuning for bias mitigation.
Score: 1.6343144783668118
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Language models (LMs) exhibit and amplify many types of undesirable biases learned from the training data, including gender bias. However, we lack tools for effectively and efficiently changing this behavior without hurting general language modeling performance. In this paper, we study three methods for identifying causal relations between LM components and particular output: causal mediation analysis, automated circuit discovery and our novel, efficient method called DiffMask+ based on differential masking. We apply the methods to GPT-2 small and the problem of gender bias, and use the discovered sets of components to perform parameter-efficient fine-tuning for bias mitigation. Our results show significant overlap in the identified components (despite huge differences in the computational requirements of the methods) as well as success in mitigating gender bias, with less damage to general language modeling compared to full model fine-tuning. However, our work also underscores the difficulty of defining and measuring bias, and the sensitivity of causal discovery procedures to dataset choice. We hope our work can contribute to more attention for dataset development, and lead to more effective mitigation strategies for other types of bias.

Related papers

How far can bias go? -- Tracing bias from pretraining data to alignment [54.51310112013655]
This study examines the correlation between gender-occupation bias in pre-training data and their manifestation in LLMs. Our findings reveal that biases present in pre-training data are amplified in model outputs.
arXiv Detail & Related papers (2024-11-28T16:20:25Z)
Gender Bias Mitigation for Bangla Classification Tasks [2.6285986998314783]
We investigate gender bias in Bangla pretrained language models. By altering names and gender-specific terms, we ensured these datasets were suitable for detecting and mitigating gender bias.
arXiv Detail & Related papers (2024-11-16T00:04:45Z)
Enhancing Training Data Attribution for Large Language Models with Fitting Error Consideration [74.09687562334682]
We introduce a novel training data attribution method called Debias and Denoise Attribution (DDA) Our method significantly outperforms existing approaches, achieving an averaged AUC of 91.64%. DDA exhibits strong generality and scalability across various sources and different-scale models like LLaMA2, QWEN2, and Mistral.
arXiv Detail & Related papers (2024-10-02T07:14:26Z)
REFINE-LM: Mitigating Language Model Stereotypes via Reinforcement Learning [18.064064773660174]
We introduce REFINE-LM, a debiasing method that uses reinforcement learning to handle different types of biases without any fine-tuning. By training a simple model on top of the word probability distribution of a LM, our bias reinforcement learning method enables model debiasing without human annotations. Experiments conducted on a wide range of models, including several LMs, show that our method significantly reduces stereotypical biases while preserving LMs performance.
arXiv Detail & Related papers (2024-08-18T14:08:31Z)
Locating and Mitigating Gender Bias in Large Language Models [40.78150878350479]
Large language models (LLM) are pre-trained on extensive corpora to learn facts and human cognition which contain human preferences. This process can inadvertently lead to these models acquiring biases and prevalent stereotypes in society. We propose the LSDM (Least Square Debias Method), a knowledge-editing based method for mitigating gender bias in occupational pronouns.
arXiv Detail & Related papers (2024-03-21T13:57:43Z)
Debiasing Algorithm through Model Adaptation [5.482673673984126]
We perform causal analysis to identify problematic model components and discover that mid-upper feed-forward layers are most prone to convey bias. Based on the analysis results, we intervene in the model by applying a linear projection to the weight matrices of these layers. Our titular method, DAMA, significantly decreases bias as measured by diverse metrics while maintaining the model's performance on downstream tasks.
arXiv Detail & Related papers (2023-10-29T05:50:03Z)
Fast Model Debias with Machine Unlearning [54.32026474971696]
Deep neural networks might behave in a biased manner in many real-world scenarios. Existing debiasing methods suffer from high costs in bias labeling or model re-training. We propose a fast model debiasing framework (FMD) which offers an efficient approach to identify, evaluate and remove biases.
arXiv Detail & Related papers (2023-10-19T08:10:57Z)
An Empirical Analysis of Parameter-Efficient Methods for Debiasing Pre-Trained Language Models [55.14405248920852]
We conduct experiments with prefix tuning, prompt tuning, and adapter tuning on different language models and bias types to evaluate their debiasing performance. We find that the parameter-efficient methods are effective in mitigating gender bias, where adapter tuning is consistently the most effective. We also find that prompt tuning is more suitable for GPT-2 than BERT, and racial and religious bias is less effective when it comes to racial and religious bias.
arXiv Detail & Related papers (2023-06-06T23:56:18Z)
Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data. We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations. Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z)
Delving into Identify-Emphasize Paradigm for Combating Unknown Bias [52.76758938921129]
We propose an effective bias-conflicting scoring method (ECS) to boost the identification accuracy. We also propose gradient alignment (GA) to balance the contributions of the mined bias-aligned and bias-conflicting samples. Experiments are conducted on multiple datasets in various settings, demonstrating that the proposed solution can mitigate the impact of unknown biases.
arXiv Detail & Related papers (2023-02-22T14:50:24Z)
The Birth of Bias: A case study on the evolution of gender bias in an English language model [1.6344851071810076]
We use a relatively small language model, using the LSTM architecture trained on an English Wikipedia corpus. We find that the representation of gender is dynamic and identify different phases during training. We show that gender information is represented increasingly locally in the input embeddings of the model.
arXiv Detail & Related papers (2022-07-21T00:59:04Z)
Balancing out Bias: Achieving Fairness Through Training Reweighting [58.201275105195485]
Bias in natural language processing arises from models learning characteristics of the author such as gender and race. Existing methods for mitigating and measuring bias do not directly account for correlations between author demographics and linguistic variables. This paper introduces a very simple but highly effective method for countering bias using instance reweighting.
arXiv Detail & Related papers (2021-09-16T23:40:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.