MiJaBench: Revealing Minority Biases in Large Language Models via Hate Speech Jailbreaking
- URL: http://arxiv.org/abs/2601.04389v1
- Date: Wed, 07 Jan 2026 20:53:18 GMT
- Title: MiJaBench: Revealing Minority Biases in Large Language Models via Hate Speech Jailbreaking
- Authors: Iago Alves Brito, Walcy Santos Rezende Rios, Julia Soares Dollis, Diogo Fernandes Costa Silva, Arlindo Rodrigues Galvão Filho,
- Abstract summary: MiJaBench is an adversarial benchmark comprising 44,000 prompts across 16 minority groups.<n>Defense rates fluctuate by up to 33% within the same model solely based on the target group.<n>We release all datasets and scripts to encourage research into granular demographic alignment at GitHub.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current safety evaluations of large language models (LLMs) create a dangerous illusion of universality, aggregating "Identity Hate" into scalar scores that mask systemic vulnerabilities against specific populations. To expose this selective safety, we introduce MiJaBench, a bilingual (English and Portuguese) adversarial benchmark comprising 44,000 prompts across 16 minority groups. By generating 528,000 prompt-response pairs from 12 state-of-the-art LLMs, we curate MiJaBench-Align, revealing that safety alignment is not a generalized semantic capability but a demographic hierarchy: defense rates fluctuate by up to 33\% within the same model solely based on the target group. Crucially, we demonstrate that model scaling exacerbates these disparities, suggesting that current alignment techniques do not create principle of non-discrimination but reinforces memorized refusal boundaries only for specific groups, challenging the current scaling laws of security. We release all datasets and scripts to encourage research into granular demographic alignment at GitHub.
Related papers
- Do Large Language Models Reflect Demographic Pluralism in Safety? [12.59854280011403]
Large Language Model (LLM) safety is inherently pluralistic, reflecting variations in moral norms, cultural expectations, and demographic contexts.<n>Demo-SafetyBench addresses this gap by modeling demographic pluralism directly at the prompt level, decoupling value framing from responses.<n>Stage I, prompts from DICES are reclassified into 14 safety domains using Mistral 7B-Instruct-v0.3, retaining demographic metadata and expanding low-resource domains.<n>Stage II, pluralistic sensitivity is evaluated using LLMs-as-Raters-Gemma-7B, GPT-4o, and LLaMA-2-7B-under zero-shot inference
arXiv Detail & Related papers (2026-02-07T05:40:10Z) - Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs [32.12545369011503]
We introduce DebateBias-8K, a new multilingual, debate-style benchmark to reveal how narrative bias appears in realistic generative settings.<n>Our dataset includes 8,400 structured debate prompts spanning four sensitive domains: women's rights, socioeconomic development, terrorism, and religion, across seven languages.<n>Results show that all models reproduce entrenched stereotypes despite safety alignment.
arXiv Detail & Related papers (2025-11-03T03:25:40Z) - Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack [53.34204977366491]
Large language models (LLMs) remain vulnerable to jailbreaking attacks despite their impressive capabilities.<n>In this paper, we introduce ISA (Intent Shift Attack), which obfuscates LLMs about the intent of the attacks.<n>Our approach only needs minimal edits to the original request, and yields natural, human-readable, and seemingly harmless prompts.
arXiv Detail & Related papers (2025-11-01T13:44:42Z) - Characterizing Selective Refusal Bias in Large Language Models [10.194832877178701]
Safety guardrails in large language models (LLMs) are developed to prevent malicious users from generating toxic content at a large scale.<n>LLMs may refuse to generate harmful content targeting some demographic groups and not others.<n>Our results show evidence of selective refusal bias across gender, sexual orientation, nationality, and religion attributes.
arXiv Detail & Related papers (2025-10-31T01:17:28Z) - BadLingual: A Novel Lingual-Backdoor Attack against Large Language Models [32.092175234635654]
We present a new form of backdoor attack against Large Language Models (LLMs): lingual-backdoor attacks.<n>We first implement a baseline lingual-backdoor attack, which is carried out by poisoning a set of training data for specific downstream tasks through translation into the trigger language.<n>To address this challenge, we design BadLingual, a novel task-agnostic lingual-backdoor, capable of triggering any downstream tasks within the chat LLMs, regardless of the specific questions of these tasks.
arXiv Detail & Related papers (2025-05-06T13:07:57Z) - Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge [1.1666234644810893]
Small models outperform larger ones in safety, suggesting that training and architecture may matter more than scale.<n>No model is fully robust to adversarial elicitation, with jailbreak attacks using low-resource languages or refusal suppression proving effective.
arXiv Detail & Related papers (2025-04-10T16:00:59Z) - SafeDialBench: A Fine-Grained Safety Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks [90.41592442792181]
We propose a fine-grained benchmark SafeDialBench for evaluating the safety of Large Language Models (LLMs)<n>Specifically, we design a two-tier hierarchical safety taxonomy that considers 6 safety dimensions and generates more than 4000 multi-turn dialogues in both Chinese and English under 22 dialogue scenarios.<n> Notably, we construct an innovative assessment framework of LLMs, measuring capabilities in detecting, and handling unsafe information and maintaining consistency when facing jailbreak attacks.
arXiv Detail & Related papers (2025-02-16T12:08:08Z) - Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking [28.95203269961824]
We propose a new jailbreak approach, RED QUEEN ATTACK, that constructs a multi-turn scenario, concealing the malicious intent under the guise of preventing harm.<n>Our experiments reveal that all LLMs are vulnerable to RED QUEEN ATTACK, reaching 87.62% attack success rate on GPT-4o and 75.4% on Llama3-70B.<n>To prioritize safety, we introduce a straightforward mitigation strategy called RED QUEEN GUARD, which aligns LLMs to effectively counter adversarial attacks.
arXiv Detail & Related papers (2024-09-26T01:24:17Z) - SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal [64.9938658716425]
SORRY-Bench is a proposed benchmark for evaluating large language models' (LLMs) ability to recognize and reject unsafe user requests.<n>First, existing methods often use coarse-grained taxonomy of unsafe topics, and are over-representing some fine-grained topics.<n>Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy.
It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z) - Reducing Target Group Bias in Hate Speech Detectors [56.94616390740415]
We show that text classification models trained on large publicly available datasets, may significantly under-perform on several protected groups.
We propose to perform token-level hate sense disambiguation, and utilize tokens' hate sense representations for detection.
arXiv Detail & Related papers (2021-12-07T17:49:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.