Muslim-Violence Bias Persists in Debiased GPT Models
- URL: http://arxiv.org/abs/2310.18368v2
- Date: Sat, 9 Dec 2023 18:11:06 GMT
- Title: Muslim-Violence Bias Persists in Debiased GPT Models
- Authors: Babak Hemmatian, Razan Baltaji, Lav R. Varshney
- Abstract summary: Using common names associated with the religions in prompts increases several-fold the rate of violent completions.
Our results show the need for continual de-biasing of models.
- Score: 18.905135223612046
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Abid et al. (2021) showed a tendency in GPT-3 to generate mostly violent
completions when prompted about Muslims, compared with other religions. Two
pre-registered replication attempts found few violent completions and only a
weak anti-Muslim bias in the more recent InstructGPT, fine-tuned to eliminate
biased and toxic outputs. However, more pre-registered experiments showed that
using common names associated with the religions in prompts increases
several-fold the rate of violent completions, revealing a significant
second-order anti-Muslim bias. ChatGPT showed a bias many times stronger
regardless of prompt format, suggesting that the effects of debiasing were
reduced with continued model development. Our content analysis revealed
religion-specific themes containing offensive stereotypes across all
experiments. Our results show the need for continual de-biasing of models in
ways that address both explicit and higher-order associations.
Related papers
- Explicit vs. Implicit: Investigating Social Bias in Large Language Models through Self-Reflection [5.800102484016876]
Large Language Models (LLMs) have been shown to exhibit various biases and stereotypes in their generated content.
This paper presents a systematic framework grounded in social psychology theories to investigate explicit and implicit biases in LLMs.
arXiv Detail & Related papers (2025-01-04T14:08:52Z) - How far can bias go? -- Tracing bias from pretraining data to alignment [54.51310112013655]
This study examines the correlation between gender-occupation bias in pre-training data and their manifestation in LLMs.
Our findings reveal that biases present in pre-training data are amplified in model outputs.
arXiv Detail & Related papers (2024-11-28T16:20:25Z) - Bias Amplification: Language Models as Increasingly Biased Media [13.556583047930065]
We propose a theoretical framework, defining the necessary and sufficient conditions for bias amplification.
We conduct experiments with GPT-2 to empirically demonstrate bias amplification.
We find that both Preservation and Accumulation effectively mitigate bias amplification and model collapse.
arXiv Detail & Related papers (2024-10-19T22:53:27Z) - From Lists to Emojis: How Format Bias Affects Model Alignment [67.08430328350327]
We study format biases in reinforcement learning from human feedback.
Many widely-used preference models, including human evaluators, exhibit strong biases towards specific format patterns.
We show that with a small amount of biased data, we can inject significant bias into the reward model.
arXiv Detail & Related papers (2024-09-18T05:13:18Z) - Take Care of Your Prompt Bias! Investigating and Mitigating Prompt Bias in Factual Knowledge Extraction [56.17020601803071]
Recent research shows that pre-trained language models (PLMs) suffer from "prompt bias" in factual knowledge extraction.
This paper aims to improve the reliability of existing benchmarks by thoroughly investigating and mitigating prompt bias.
arXiv Detail & Related papers (2024-03-15T02:04:35Z) - What's in a Name? Auditing Large Language Models for Race and Gender
Bias [49.28899492966893]
We employ an audit design to investigate biases in state-of-the-art large language models, including GPT-4.
We find that the advice systematically disadvantages names that are commonly associated with racial minorities and women.
arXiv Detail & Related papers (2024-02-21T18:25:25Z) - GPTBIAS: A Comprehensive Framework for Evaluating Bias in Large Language
Models [83.30078426829627]
Large language models (LLMs) have gained popularity and are being widely adopted by a large user community.
The existing evaluation methods have many constraints, and their results exhibit a limited degree of interpretability.
We propose a bias evaluation framework named GPTBIAS that leverages the high performance of LLMs to assess bias in models.
arXiv Detail & Related papers (2023-12-11T12:02:14Z) - Debiased Large Language Models Still Associate Muslims with Uniquely
Violent Acts [24.633323508534254]
Using common names associated with the religions in prompts yields a highly significant increase in violent completions.
Names of Muslim celebrities from non-violent domains resulted in relatively fewer violent completions.
Our results show the need for additional debiasing of large language models to address higher-order schemas and associations.
arXiv Detail & Related papers (2022-08-08T20:59:16Z) - Reducing the Vision and Language Bias for Temporal Sentence Grounding [22.571577672704716]
We propose a Debiasing-TSG (D-TSG) model to filter and remove the negative biases in both vision and language modalities.
We demonstrate its effectiveness by achieving the state-of-the-art performance on three benchmark datasets.
arXiv Detail & Related papers (2022-07-27T11:18:45Z) - The SAME score: Improved cosine based bias score for word embeddings [49.75878234192369]
We introduce SAME, a novel bias score for semantic bias in embeddings.
We show that SAME is capable of measuring semantic bias and identify potential causes for social bias in downstream tasks.
arXiv Detail & Related papers (2022-03-28T09:28:13Z) - Persistent Anti-Muslim Bias in Large Language Models [13.984800635696566]
GPT-3, a state-of-the-art contextual language model, captures persistent Muslim-violence bias.
We probe GPT-3 in various ways, including prompt completion, analogical reasoning, and story generation.
For instance, "Muslim" is analogized to "terrorist" in 23% of test cases, while "Jewish" is mapped to "money" in 5% of test cases.
arXiv Detail & Related papers (2021-01-14T18:41:55Z) - "Thy algorithm shalt not bear false witness": An Evaluation of
Multiclass Debiasing Methods on Word Embeddings [3.0204693431381515]
The paper investigates the state-of-the-art multiclass debiasing techniques: Hard debiasing, SoftWEAT debiasing and Conceptor debiasing.
It evaluates their performance when removing religious bias on a common basis by quantifying bias removal via the Word Embedding Association Test (WEAT), Mean Average Cosine Similarity (MAC) and the Relative Negative Sentiment Bias (RNSB)
arXiv Detail & Related papers (2020-10-30T12:49:39Z) - Towards Controllable Biases in Language Generation [87.89632038677912]
We develop a method to induce societal biases in generated text when input prompts contain mentions of specific demographic groups.
We analyze two scenarios: 1) inducing negative biases for one demographic and positive biases for another demographic, and 2) equalizing biases between demographics.
arXiv Detail & Related papers (2020-05-01T08:25:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.