Related papers: CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs

CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs

URL: http://arxiv.org/abs/2510.09871v1
Date: Fri, 10 Oct 2025 21:09:01 GMT
Title: CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs
Authors: Nafiseh Nikeghbal, Amir Hossein Kargaran, Jana Diesner,
Abstract summary: We introduce CoBia, a suite of lightweight adversarial attacks that allow us to refine the scope of conditions under which Large language models depart from normative or ethical behavior.<n>CoBia creates a constructed conversation where the model utters a biased claim about a social group.<n>We then evaluate whether the model can recover from the fabricated bias claim and reject biased follow-up questions.
Score: 10.340166874690578
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Improvements in model construction, including fortified safety guardrails, allow Large language models (LLMs) to increasingly pass standard safety checks. However, LLMs sometimes slip into revealing harmful behavior, such as expressing racist viewpoints, during conversations. To analyze this systematically, we introduce CoBia, a suite of lightweight adversarial attacks that allow us to refine the scope of conditions under which LLMs depart from normative or ethical behavior in conversations. CoBia creates a constructed conversation where the model utters a biased claim about a social group. We then evaluate whether the model can recover from the fabricated bias claim and reject biased follow-up questions. We evaluate 11 open-source as well as proprietary LLMs for their outputs related to six socio-demographic categories that are relevant to individual safety and fair treatment, i.e., gender, race, religion, nationality, sex orientation, and others. Our evaluation is based on established LLM-based bias metrics, and we compare the results against human judgments to scope out the LLMs' reliability and alignment. The results suggest that purposefully constructed conversations reliably reveal bias amplification and that LLMs often fail to reject biased follow-up questions during dialogue. This form of stress-testing highlights deeply embedded biases that can be surfaced through interaction. Code and artifacts are available at https://github.com/nafisenik/CoBia.

Related papers

Language Models Surface the Unwritten Code of Science and Society [1.6245906033871593]
This paper calls on the research community to investigate how human biases are inherited by large language models (LLMs)<n>We introduce a conceptual framework through a case study in science: uncovering hidden rules in peer review.
arXiv Detail & Related papers (2025-05-25T02:28:40Z)
DIF: A Framework for Benchmarking and Verifying Implicit Bias in LLMs [1.89915151018241]
We argue that implicit bias in Large Language Models (LLMs) is not only an ethical, but also a technical issue.<n>We developed a method for calculating an easily interpretable benchmark, DIF (Demographic Implicit Fairness)
arXiv Detail & Related papers (2025-05-15T06:53:37Z)
Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge [1.1666234644810893]
Small models outperform larger ones in safety, suggesting that training and architecture may matter more than scale.<n>No model is fully robust to adversarial elicitation, with jailbreak attacks using low-resource languages or refusal suppression proving effective.
arXiv Detail & Related papers (2025-04-10T16:00:59Z)
Preference Leakage: A Contamination Problem in LLM-as-a-judge [69.96778498636071]
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods.<n>In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
arXiv Detail & Related papers (2025-02-03T17:13:03Z)
Aligning Large Language Models for Faithful Integrity Against Opposing Argument [71.33552795870544]
Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks.<n>They can be easily misled by unfaithful arguments during conversations, even when their original statements are correct.<n>We propose a novel framework, named Alignment for Faithful Integrity with Confidence Estimation.
arXiv Detail & Related papers (2025-01-02T16:38:21Z)
Evaluating and Mitigating Social Bias for Large Language Models in Open-ended Settings [18.318232025355524]
We extend an existing dataset BBQ to Open-BBQ to evaluate the social bias of LLMs in open-ended settings.<n>We develop an evaluation process to detect biases from open-ended content by labeling sentences and paragraphs.<n>In order to solve this issue, we propose Composite Prompting, an In-context Learning (ICL) method combining structured examples with explicit chain-of-thought reasoning.
arXiv Detail & Related papers (2024-12-09T01:29:47Z)
Promoting Equality in Large Language Models: Identifying and Mitigating the Implicit Bias based on Bayesian Theory [29.201402717025335]
Large language models (LLMs) are trained on extensive text corpora, which inevitably include biased information. We have formally defined the implicit bias problem and developed an innovative framework for bias removal based on Bayesian theory.
arXiv Detail & Related papers (2024-08-20T07:40:12Z)
Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective [66.34066553400108]
We conduct a rigorous evaluation of large language models' implicit bias towards certain demographics.<n>Inspired by psychometric principles, we propose three attack approaches, i.e., Disguise, Deception, and Teaching.<n>Our methods can elicit LLMs' inner bias more effectively than competitive baselines.
arXiv Detail & Related papers (2024-06-20T06:42:08Z)
WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia [59.96425443250666]
Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs) In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions based on contradictory passages from Wikipedia. We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages.
arXiv Detail & Related papers (2024-06-19T20:13:42Z)
Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis [86.49858739347412]
Large Language Models (LLMs) have sparked intense debate regarding the prevalence of bias in these models and its mitigation. We propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the decision process. We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment.
arXiv Detail & Related papers (2023-11-15T00:02:25Z)
Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.