Related papers: Characterizing Selective Refusal Bias in Large Language Models

Characterizing Selective Refusal Bias in Large Language Models

URL: http://arxiv.org/abs/2510.27087v1
Date: Fri, 31 Oct 2025 01:17:28 GMT
Title: Characterizing Selective Refusal Bias in Large Language Models
Authors: Adel Khorramrouz, Sharon Levy,
Abstract summary: Safety guardrails in large language models (LLMs) are developed to prevent malicious users from generating toxic content at a large scale.<n>LLMs may refuse to generate harmful content targeting some demographic groups and not others.<n>Our results show evidence of selective refusal bias across gender, sexual orientation, nationality, and religion attributes.
Score: 10.194832877178701
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Safety guardrails in large language models(LLMs) are developed to prevent malicious users from generating toxic content at a large scale. However, these measures can inadvertently introduce or reflect new biases, as LLMs may refuse to generate harmful content targeting some demographic groups and not others. We explore this selective refusal bias in LLM guardrails through the lens of refusal rates of targeted individual and intersectional demographic groups, types of LLM responses, and length of generated refusals. Our results show evidence of selective refusal bias across gender, sexual orientation, nationality, and religion attributes. This leads us to investigate additional safety implications via an indirect attack, where we target previously refused groups. Our findings emphasize the need for more equitable and robust performance in safety guardrails across demographic groups.

Related papers

Analyzing Bias in False Refusal Behavior of Large Language Models for Hate Speech Detoxification [7.696781721646013]
We investigate false refusal behavior in hate speech detoxification.<n>We show that large language models (LLMs) disproportionately refuse inputs with higher semantic toxicity.<n>We propose a simple cross-translation strategy, translating English hate speech into Chinese for detoxification and back.
arXiv Detail & Related papers (2026-01-13T15:45:31Z)
Large Language Models' Complicit Responses to Illicit Instructions across Socio-Legal Contexts [54.15982476754607]
Large language models (LLMs) are now deployed at unprecedented scale, assisting millions of users in daily tasks.<n>This study defines complicit facilitation as the provision of guidance or support that enables illicit user instructions.<n>Using real-world legal cases and established legal frameworks, we construct an evaluation benchmark spanning 269 illicit scenarios and 50 illicit intents.
arXiv Detail & Related papers (2025-11-25T16:01:31Z)
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts [79.1081247754018]
Large Language Models (LLMs) are widely deployed in reasoning, planning, and decision-making tasks.<n>We propose a framework based on Contact Searching Questions(CSQ) to quantify the likelihood of deception.
arXiv Detail & Related papers (2025-08-08T14:46:35Z)
Refusal Direction is Universal Across Safety-Aligned Languages [66.64709923081745]
In this paper, we investigate the refusal behavior in large language models (LLMs) across 14 languages using PolyRefuse.<n>We uncover the surprising cross-lingual universality of the refusal direction: a vector extracted from English can bypass refusals in other languages with near-perfect effectiveness.<n>We attribute this transferability to the parallelism of refusal vectors across languages in the embedding space and identify the underlying mechanism behind cross-lingual jailbreaks.
arXiv Detail & Related papers (2025-05-22T21:54:46Z)
Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge [1.1666234644810893]
Small models outperform larger ones in safety, suggesting that training and architecture may matter more than scale.<n>No model is fully robust to adversarial elicitation, with jailbreak attacks using low-resource languages or refusal suppression proving effective.
arXiv Detail & Related papers (2025-04-10T16:00:59Z)
The Root Shapes the Fruit: On the Persistence of Gender-Exclusive Harms in Aligned Language Models [91.86718720024825]
We center transgender, nonbinary, and other gender-diverse identities to investigate how alignment procedures interact with pre-existing gender-diverse bias.<n>Our findings reveal that DPO-aligned models are particularly sensitive to supervised finetuning.<n>We conclude with recommendations tailored to DPO and broader alignment practices.
arXiv Detail & Related papers (2024-11-06T06:50:50Z)
Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective [66.34066553400108]
We conduct a rigorous evaluation of large language models' implicit bias towards certain demographics.<n>Inspired by psychometric principles, we propose three attack approaches, i.e., Disguise, Deception, and Teaching.<n>Our methods can elicit LLMs' inner bias more effectively than competitive baselines.
arXiv Detail & Related papers (2024-06-20T06:42:08Z)
LIDAO: Towards Limited Interventions for Debiasing (Large) Language Models [19.18522268167047]
Large language models (LLMs) have achieved impressive performance on various natural language generation tasks. However, they suffer from generating negative and harmful contents that are biased against certain demographic groups. We propose LIDAO, a framework to debias a (L)LM at a better fluency provably.
arXiv Detail & Related papers (2024-06-01T20:12:54Z)
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. Our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z)
Bias and Volatility: A Statistical Framework for Evaluating Large Language Model's Stereotypes and the Associated Generation Inconsistency [33.17945055081054]
Current alignment evaluation metrics often overlook stereotypes' randomness caused by large language models' inconsistent generative behavior.<n>We propose the Bias-Volatility Framework (BVF), which estimates the probability distribution of stereotypes in LLM outputs.
arXiv Detail & Related papers (2024-02-23T18:15:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.