Debiased Large Language Models Still Associate Muslims with Uniquely
Violent Acts
- URL: http://arxiv.org/abs/2208.04417v2
- Date: Wed, 10 Aug 2022 13:49:54 GMT
- Title: Debiased Large Language Models Still Associate Muslims with Uniquely
Violent Acts
- Authors: Babak Hemmatian, Lav R. Varshney
- Abstract summary: Using common names associated with the religions in prompts yields a highly significant increase in violent completions.
Names of Muslim celebrities from non-violent domains resulted in relatively fewer violent completions.
Our results show the need for additional debiasing of large language models to address higher-order schemas and associations.
- Score: 24.633323508534254
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work demonstrates a bias in the GPT-3 model towards generating violent
text completions when prompted about Muslims, compared with Christians and
Hindus. Two pre-registered replication attempts, one exact and one approximate,
found only the weakest bias in the more recent Instruct Series version of
GPT-3, fine-tuned to eliminate biased and toxic outputs. Few violent
completions were observed. Additional pre-registered experiments, however,
showed that using common names associated with the religions in prompts yields
a highly significant increase in violent completions, also revealing a stronger
second-order bias against Muslims. Names of Muslim celebrities from non-violent
domains resulted in relatively fewer violent completions, suggesting that
access to individualized information can steer the model away from using
stereotypes. Nonetheless, content analysis revealed religion-specific violent
themes containing highly offensive ideas regardless of prompt format. Our
results show the need for additional debiasing of large language models to
address higher-order schemas and associations.
Related papers
- What's in a Name? Auditing Large Language Models for Race and Gender
Bias [49.28899492966893]
We employ an audit design to investigate biases in state-of-the-art large language models, including GPT-4.
We find that the advice systematically disadvantages names that are commonly associated with racial minorities and women.
arXiv Detail & Related papers (2024-02-21T18:25:25Z) - What Do Llamas Really Think? Revealing Preference Biases in Language
Model Representations [62.91799637259657]
Do large language models (LLMs) exhibit sociodemographic biases, even when they decline to respond?
We study this research question by probing contextualized embeddings and exploring whether this bias is encoded in its latent representations.
We propose a logistic Bradley-Terry probe which predicts word pair preferences of LLMs from the words' hidden vectors.
arXiv Detail & Related papers (2023-11-30T18:53:13Z) - Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines.
Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations'
In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z) - Muslim-Violence Bias Persists in Debiased GPT Models [18.905135223612046]
Using common names associated with the religions in prompts increases several-fold the rate of violent completions.
Our results show the need for continual de-biasing of models.
arXiv Detail & Related papers (2023-10-25T19:39:58Z) - Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models [11.330830398772582]
We present a novel framework dubbed textittoxicity rabbit hole that iteratively elicits toxic content from a wide suite of large language models.
We present a broad analysis with a key emphasis on racism, antisemitism, misogyny, Islamophobia, homophobia, and transphobia.
arXiv Detail & Related papers (2023-09-08T03:59:02Z) - PACO: Provocation Involving Action, Culture, and Oppression [13.70482307997736]
In India, people identify with a particular group based on certain attributes such as religion.
The same religious groups are often provoked against each other.
Previous studies show the role of provocation in increasing tensions between India's two prominent religious groups: Hindus and Muslims.
arXiv Detail & Related papers (2023-03-19T04:39:36Z) - Discovering and Mitigating Visual Biases through Keyword Explanation [66.71792624377069]
We propose the Bias-to-Text (B2T) framework, which interprets visual biases as keywords.
B2T can identify known biases, such as gender bias in CelebA, background bias in Waterbirds, and distribution shifts in ImageNet-R/C.
B2T uncovers novel biases in larger datasets, such as Dollar Street and ImageNet.
arXiv Detail & Related papers (2023-01-26T13:58:46Z) - Exploring Hate Speech Detection with HateXplain and BERT [2.673732496490253]
Hate Speech takes many forms to target communities with derogatory comments, and takes humanity a step back in societal progress.
HateXplain is a recently published and first dataset to use annotated spans in the form of rationales, along with speech classification categories and targeted communities.
We tune BERT to perform this task in the form of rationales and class prediction, and compare our performance on different metrics spanning across accuracy, explainability and bias.
arXiv Detail & Related papers (2022-08-09T01:32:44Z) - The World of an Octopus: How Reporting Bias Influences a Language
Model's Perception of Color [73.70233477125781]
We show that reporting bias negatively impacts and inherently limits text-only training.
We then demonstrate that multimodal models can leverage their visual training to mitigate these effects.
arXiv Detail & Related papers (2021-10-15T16:28:17Z) - Persistent Anti-Muslim Bias in Large Language Models [13.984800635696566]
GPT-3, a state-of-the-art contextual language model, captures persistent Muslim-violence bias.
We probe GPT-3 in various ways, including prompt completion, analogical reasoning, and story generation.
For instance, "Muslim" is analogized to "terrorist" in 23% of test cases, while "Jewish" is mapped to "money" in 5% of test cases.
arXiv Detail & Related papers (2021-01-14T18:41:55Z) - Improving Robustness by Augmenting Training Sentences with
Predicate-Argument Structures [62.562760228942054]
Existing approaches to improve robustness against dataset biases mostly focus on changing the training objective.
We propose to augment the input sentences in the training data with their corresponding predicate-argument structures.
We show that without targeting a specific bias, our sentence augmentation improves the robustness of transformer models against multiple biases.
arXiv Detail & Related papers (2020-10-23T16:22:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.