Related papers: Debiased Large Language Models Still Associate Muslims with Uniquely Violent Acts

Debiased Large Language Models Still Associate Muslims with Uniquely Violent Acts

URL: http://arxiv.org/abs/2208.04417v2
Date: Wed, 10 Aug 2022 13:49:54 GMT
Title: Debiased Large Language Models Still Associate Muslims with Uniquely Violent Acts
Authors: Babak Hemmatian, Lav R. Varshney
Abstract summary: Using common names associated with the religions in prompts yields a highly significant increase in violent completions. Names of Muslim celebrities from non-violent domains resulted in relatively fewer violent completions. Our results show the need for additional debiasing of large language models to address higher-order schemas and associations.
Score: 24.633323508534254
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent work demonstrates a bias in the GPT-3 model towards generating violent text completions when prompted about Muslims, compared with Christians and Hindus. Two pre-registered replication attempts, one exact and one approximate, found only the weakest bias in the more recent Instruct Series version of GPT-3, fine-tuned to eliminate biased and toxic outputs. Few violent completions were observed. Additional pre-registered experiments, however, showed that using common names associated with the religions in prompts yields a highly significant increase in violent completions, also revealing a stronger second-order bias against Muslims. Names of Muslim celebrities from non-violent domains resulted in relatively fewer violent completions, suggesting that access to individualized information can steer the model away from using stereotypes. Nonetheless, content analysis revealed religion-specific violent themes containing highly offensive ideas regardless of prompt format. Our results show the need for additional debiasing of large language models to address higher-order schemas and associations.

Related papers

Chain-of-Defensive-Thought: Structured Reasoning Elicits Robustness in Large Language Models against Reference Corruption [51.98089842456886]
We show how a wide range of large language models exhibit significantly improved robustness against reference corruption using a simple method called chain-of-defensive-thought. Empirically, the improvements can be astounding, especially given the simplicity and applicability of the method.
arXiv Detail & Related papers (2025-04-29T13:50:05Z)
Web Artifact Attacks Disrupt Vision Language Models [61.59021920232986]
Vision-language models (VLMs) are trained on large-scale, lightly curated web datasets. They learn unintended correlations between semantic concepts and unrelated visual signals. Prior work has weaponized these correlations as an attack vector to manipulate model predictions. We introduce artifact-based attacks: a novel class of manipulations that mislead models using both non-matching text and graphical elements.
arXiv Detail & Related papers (2025-03-17T18:59:29Z)
Religious Bias Landscape in Language and Text-to-Image Models: Analysis, Detection, and Debiasing Strategies [16.177734242454193]
The widespread adoption of language models highlights the need for critical examinations of their inherent biases. This study systematically investigates religious bias in both language models and text-to-image generation models.
arXiv Detail & Related papers (2025-01-14T21:10:08Z)
Bias Amplification: Language Models as Increasingly Biased Media [13.556583047930065]
We propose a theoretical framework, defining the necessary and sufficient conditions for bias amplification. We conduct experiments with GPT-2 to empirically demonstrate bias amplification. We find that both Preservation and Accumulation effectively mitigate bias amplification and model collapse.
arXiv Detail & Related papers (2024-10-19T22:53:27Z)
From Lists to Emojis: How Format Bias Affects Model Alignment [67.08430328350327]
We study format biases in reinforcement learning from human feedback. Many widely-used preference models, including human evaluators, exhibit strong biases towards specific format patterns. We show that with a small amount of biased data, we can inject significant bias into the reward model.
arXiv Detail & Related papers (2024-09-18T05:13:18Z)
Exploring Bengali Religious Dialect Biases in Large Language Models with Evaluation Perspectives [5.648318448953635]
Large Language Models (LLM) can produce output that contains stereotypes and biases. We explore bias from a religious perspective in Bengali, focusing specifically on two main religious dialects: Hindu and Muslim-majority dialects.
arXiv Detail & Related papers (2024-07-25T20:19:29Z)
White Men Lead, Black Women Help? Benchmarking and Mitigating Language Agency Social Biases in LLMs [58.27353205269664]
Social biases can manifest in language agency in Large Language Model (LLM)-generated content.<n>We introduce the Language Agency Bias Evaluation benchmark, which comprehensively evaluates biases in LLMs.<n>Using LABE, we unveil language agency social biases in 3 recent LLMs: ChatGPT, Llama3, and Mistral.
arXiv Detail & Related papers (2024-04-16T12:27:54Z)
What's in a Name? Auditing Large Language Models for Race and Gender Bias [49.28899492966893]
We employ an audit design to investigate biases in state-of-the-art large language models, including GPT-4. We find that the advice systematically disadvantages names that are commonly associated with racial minorities and women.
arXiv Detail & Related papers (2024-02-21T18:25:25Z)
What Do Llamas Really Think? Revealing Preference Biases in Language Model Representations [62.91799637259657]
Do large language models (LLMs) exhibit sociodemographic biases, even when they decline to respond? We study this research question by probing contextualized embeddings and exploring whether this bias is encoded in its latent representations. We propose a logistic Bradley-Terry probe which predicts word pair preferences of LLMs from the words' hidden vectors.
arXiv Detail & Related papers (2023-11-30T18:53:13Z)
Muslim-Violence Bias Persists in Debiased GPT Models [18.905135223612046]
Using common names associated with the religions in prompts increases several-fold the rate of violent completions. Our results show the need for continual de-biasing of models.
arXiv Detail & Related papers (2023-10-25T19:39:58Z)
Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models [11.330830398772582]
We present a novel framework dubbed textittoxicity rabbit hole that iteratively elicits toxic content from a wide suite of large language models. We present a broad analysis with a key emphasis on racism, antisemitism, misogyny, Islamophobia, homophobia, and transphobia.
arXiv Detail & Related papers (2023-09-08T03:59:02Z)
Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models. We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks. Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z)
Discovering and Mitigating Visual Biases through Keyword Explanation [66.71792624377069]
We propose the Bias-to-Text (B2T) framework, which interprets visual biases as keywords. B2T can identify known biases, such as gender bias in CelebA, background bias in Waterbirds, and distribution shifts in ImageNet-R/C. B2T uncovers novel biases in larger datasets, such as Dollar Street and ImageNet.
arXiv Detail & Related papers (2023-01-26T13:58:46Z)
Persistent Anti-Muslim Bias in Large Language Models [13.984800635696566]
GPT-3, a state-of-the-art contextual language model, captures persistent Muslim-violence bias. We probe GPT-3 in various ways, including prompt completion, analogical reasoning, and story generation. For instance, "Muslim" is analogized to "terrorist" in 23% of test cases, while "Jewish" is mapped to "money" in 5% of test cases.
arXiv Detail & Related papers (2021-01-14T18:41:55Z)
Improving Robustness by Augmenting Training Sentences with Predicate-Argument Structures [62.562760228942054]
Existing approaches to improve robustness against dataset biases mostly focus on changing the training objective. We propose to augment the input sentences in the training data with their corresponding predicate-argument structures. We show that without targeting a specific bias, our sentence augmentation improves the robustness of transformer models against multiple biases.
arXiv Detail & Related papers (2020-10-23T16:22:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.