Muslim-Violence Bias Persists in Debiased GPT Models
- URL: http://arxiv.org/abs/2310.18368v2
- Date: Sat, 9 Dec 2023 18:11:06 GMT
- Title: Muslim-Violence Bias Persists in Debiased GPT Models
- Authors: Babak Hemmatian, Razan Baltaji, Lav R. Varshney
- Abstract summary: Using common names associated with the religions in prompts increases several-fold the rate of violent completions.
Our results show the need for continual de-biasing of models.
- Score: 18.905135223612046
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Abid et al. (2021) showed a tendency in GPT-3 to generate mostly violent
completions when prompted about Muslims, compared with other religions. Two
pre-registered replication attempts found few violent completions and only a
weak anti-Muslim bias in the more recent InstructGPT, fine-tuned to eliminate
biased and toxic outputs. However, more pre-registered experiments showed that
using common names associated with the religions in prompts increases
several-fold the rate of violent completions, revealing a significant
second-order anti-Muslim bias. ChatGPT showed a bias many times stronger
regardless of prompt format, suggesting that the effects of debiasing were
reduced with continued model development. Our content analysis revealed
religion-specific themes containing offensive stereotypes across all
experiments. Our results show the need for continual de-biasing of models in
ways that address both explicit and higher-order associations.
Related papers
- Take Care of Your Prompt Bias! Investigating and Mitigating Prompt Bias in Factual Knowledge Extraction [56.17020601803071]
Recent research shows that pre-trained language models (PLMs) suffer from "prompt bias" in factual knowledge extraction.
This paper aims to improve the reliability of existing benchmarks by thoroughly investigating and mitigating prompt bias.
arXiv Detail & Related papers (2024-03-15T02:04:35Z) - What's in a Name? Auditing Large Language Models for Race and Gender
Bias [49.28899492966893]
We employ an audit design to investigate biases in state-of-the-art large language models, including GPT-4.
We find that the advice systematically disadvantages names that are commonly associated with racial minorities and women.
arXiv Detail & Related papers (2024-02-21T18:25:25Z) - GPTBIAS: A Comprehensive Framework for Evaluating Bias in Large Language
Models [83.30078426829627]
Large language models (LLMs) have gained popularity and are being widely adopted by a large user community.
The existing evaluation methods have many constraints, and their results exhibit a limited degree of interpretability.
We propose a bias evaluation framework named GPTBIAS that leverages the high performance of LLMs to assess bias in models.
arXiv Detail & Related papers (2023-12-11T12:02:14Z) - OpinionGPT: Modelling Explicit Biases in Instruction-Tuned LLMs [3.5342505775640247]
We present OpinionGPT, a web demo in which users can ask questions and select all biases they wish to investigate.
The demo will answer this question using a model fine-tuned on text representing each of the selected biases.
To train the underlying model, we identified 11 different biases (political, geographic, gender, age) and derived an instruction-tuning corpus in which each answer was written by members of one of these demographics.
arXiv Detail & Related papers (2023-09-07T17:41:01Z) - Discovering and Mitigating Visual Biases through Keyword Explanation [66.71792624377069]
We propose the Bias-to-Text (B2T) framework, which interprets visual biases as keywords.
B2T can identify known biases, such as gender bias in CelebA, background bias in Waterbirds, and distribution shifts in ImageNet-R/C.
B2T uncovers novel biases in larger datasets, such as Dollar Street and ImageNet.
arXiv Detail & Related papers (2023-01-26T13:58:46Z) - Debiased Large Language Models Still Associate Muslims with Uniquely
Violent Acts [24.633323508534254]
Using common names associated with the religions in prompts yields a highly significant increase in violent completions.
Names of Muslim celebrities from non-violent domains resulted in relatively fewer violent completions.
Our results show the need for additional debiasing of large language models to address higher-order schemas and associations.
arXiv Detail & Related papers (2022-08-08T20:59:16Z) - Reducing the Vision and Language Bias for Temporal Sentence Grounding [22.571577672704716]
We propose a Debiasing-TSG (D-TSG) model to filter and remove the negative biases in both vision and language modalities.
We demonstrate its effectiveness by achieving the state-of-the-art performance on three benchmark datasets.
arXiv Detail & Related papers (2022-07-27T11:18:45Z) - NeuS: Neutral Multi-News Summarization for Mitigating Framing Bias [54.89737992911079]
We propose a new task, a neutral summary generation from multiple news headlines of the varying political spectrum.
One of the most interesting observations is that generation models can hallucinate not only factually inaccurate or unverifiable content, but also politically biased content.
arXiv Detail & Related papers (2022-04-11T07:06:01Z) - Persistent Anti-Muslim Bias in Large Language Models [13.984800635696566]
GPT-3, a state-of-the-art contextual language model, captures persistent Muslim-violence bias.
We probe GPT-3 in various ways, including prompt completion, analogical reasoning, and story generation.
For instance, "Muslim" is analogized to "terrorist" in 23% of test cases, while "Jewish" is mapped to "money" in 5% of test cases.
arXiv Detail & Related papers (2021-01-14T18:41:55Z) - "Thy algorithm shalt not bear false witness": An Evaluation of
Multiclass Debiasing Methods on Word Embeddings [3.0204693431381515]
The paper investigates the state-of-the-art multiclass debiasing techniques: Hard debiasing, SoftWEAT debiasing and Conceptor debiasing.
It evaluates their performance when removing religious bias on a common basis by quantifying bias removal via the Word Embedding Association Test (WEAT), Mean Average Cosine Similarity (MAC) and the Relative Negative Sentiment Bias (RNSB)
arXiv Detail & Related papers (2020-10-30T12:49:39Z) - Towards Controllable Biases in Language Generation [87.89632038677912]
We develop a method to induce societal biases in generated text when input prompts contain mentions of specific demographic groups.
We analyze two scenarios: 1) inducing negative biases for one demographic and positive biases for another demographic, and 2) equalizing biases between demographics.
arXiv Detail & Related papers (2020-05-01T08:25:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.