Self-Debiasing Large Language Models: Zero-Shot Recognition and
Reduction of Stereotypes
- URL: http://arxiv.org/abs/2402.01981v1
- Date: Sat, 3 Feb 2024 01:40:11 GMT
- Title: Self-Debiasing Large Language Models: Zero-Shot Recognition and
Reduction of Stereotypes
- Authors: Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Tong
Yu, Hanieh Deilamsalehy, Ruiyi Zhang, Sungchul Kim, Franck Dernoncourt
- Abstract summary: We leverage the zero-shot capabilities of large language models to reduce stereotyping.
We show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups.
We hope this work opens inquiry into other zero-shot techniques for bias mitigation.
- Score: 73.12947922129261
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have shown remarkable advances in language
generation and understanding but are also prone to exhibiting harmful social
biases. While recognition of these behaviors has generated an abundance of bias
mitigation techniques, most require modifications to the training data, model
parameters, or decoding strategy, which may be infeasible without access to a
trainable model. In this work, we leverage the zero-shot capabilities of LLMs
to reduce stereotyping in a technique we introduce as zero-shot self-debiasing.
With two approaches, self-debiasing via explanation and self-debiasing via
reprompting, we show that self-debiasing can significantly reduce the degree of
stereotyping across nine different social groups while relying only on the LLM
itself and a simple prompt, with explanations correctly identifying invalid
assumptions and reprompting delivering the greatest reductions in bias. We hope
this work opens inquiry into other zero-shot techniques for bias mitigation.
Related papers
- Promoting Equality in Large Language Models: Identifying and Mitigating the Implicit Bias based on Bayesian Theory [29.201402717025335]
Large language models (LLMs) are trained on extensive text corpora, which inevitably include biased information.
We have formally defined the implicit bias problem and developed an innovative framework for bias removal based on Bayesian theory.
arXiv Detail & Related papers (2024-08-20T07:40:12Z) - REFINE-LM: Mitigating Language Model Stereotypes via Reinforcement Learning [18.064064773660174]
We introduce REFINE-LM, a debiasing method that uses reinforcement learning to handle different types of biases without any fine-tuning.
By training a simple model on top of the word probability distribution of a LM, our bias reinforcement learning method enables model debiasing without human annotations.
Experiments conducted on a wide range of models, including several LMs, show that our method significantly reduces stereotypical biases while preserving LMs performance.
arXiv Detail & Related papers (2024-08-18T14:08:31Z) - Editable Fairness: Fine-Grained Bias Mitigation in Language Models [52.66450426729818]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases.
FAST surpasses state-of-the-art baselines with superior debiasing performance.
This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z) - Towards Transfer Unlearning: Empirical Evidence of Cross-Domain Bias Mitigation [18.150899267807965]
We study an unlearning-based approach to debiasing in large language models (LLMs)
We propose a mask language modeling unlearning technique, which unlearns the harmful part of the text.
Experimental results demonstrate the effectiveness of our approach in diminishing bias while maintaining the language modeling abilities.
arXiv Detail & Related papers (2024-07-24T02:37:42Z) - The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models [78.69526166193236]
Pre-trained Language models (PLMs) have been acknowledged to contain harmful information, such as social biases.
We propose sc Social Bias Neurons to accurately pinpoint units (i.e., neurons) in a language model that can be attributed to undesirable behavior, such as social bias.
As measured by prior metrics from StereoSet, our model achieves a higher degree of fairness while maintaining language modeling ability with low cost.
arXiv Detail & Related papers (2024-06-14T15:41:06Z) - Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination [54.865941973768905]
We propose a novel and practical bias mitigation method, CRISPR, to eliminate bias neurons of language models in instruction-following settings.
CRISPR automatically determines biased outputs and categorizes neurons that affect the biased outputs as bias neurons using an explainability method.
Experimental results demonstrate the effectiveness of our method in mitigating biases under zero-shot instruction-following settings without losing the model's task performance and existing knowledge.
arXiv Detail & Related papers (2023-11-16T07:16:55Z) - Debiasing Vision-Language Models via Biased Prompts [79.04467131711775]
We propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding.
We show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models.
arXiv Detail & Related papers (2023-01-31T20:09:33Z) - Few-shot Instruction Prompts for Pretrained Language Models to Detect
Social Biases [55.45617404586874]
We propose a few-shot instruction-based method for prompting pre-trained language models (LMs)
We show that large LMs can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models.
arXiv Detail & Related papers (2021-12-15T04:19:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.