Do Large Language Models Reflect Demographic Pluralism in Safety?
- URL: http://arxiv.org/abs/2602.07376v1
- Date: Sat, 07 Feb 2026 05:40:10 GMT
- Title: Do Large Language Models Reflect Demographic Pluralism in Safety?
- Authors: Usman Naseem, Gautam Siddharth Kashyap, Sushant Kumar Ray, Rafiq Ali, Ebad Shabbir, Abdullah Mohammad,
- Abstract summary: Large Language Model (LLM) safety is inherently pluralistic, reflecting variations in moral norms, cultural expectations, and demographic contexts.<n>Demo-SafetyBench addresses this gap by modeling demographic pluralism directly at the prompt level, decoupling value framing from responses.<n>Stage I, prompts from DICES are reclassified into 14 safety domains using Mistral 7B-Instruct-v0.3, retaining demographic metadata and expanding low-resource domains.<n>Stage II, pluralistic sensitivity is evaluated using LLMs-as-Raters-Gemma-7B, GPT-4o, and LLaMA-2-7B-under zero-shot inference
- Score: 12.59854280011403
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Model (LLM) safety is inherently pluralistic, reflecting variations in moral norms, cultural expectations, and demographic contexts. Yet, existing alignment datasets such as ANTHROPIC-HH and DICES rely on demographically narrow annotator pools, overlooking variation in safety perception across communities. Demo-SafetyBench addresses this gap by modeling demographic pluralism directly at the prompt level, decoupling value framing from responses. In Stage I, prompts from DICES are reclassified into 14 safety domains (adapted from BEAVERTAILS) using Mistral 7B-Instruct-v0.3, retaining demographic metadata and expanding low-resource domains via Llama-3.1-8B-Instruct with SimHash-based deduplication, yielding 43,050 samples. In Stage II, pluralistic sensitivity is evaluated using LLMs-as-Raters-Gemma-7B, GPT-4o, and LLaMA-2-7B-under zero-shot inference. Balanced thresholds (delta = 0.5, tau = 10) achieve high reliability (ICC = 0.87) and low demographic sensitivity (DS = 0.12), confirming that pluralistic safety evaluation can be both scalable and demographically robust.
Related papers
- Are Aligned Large Language Models Still Misaligned? [13.062124372682106]
Mis-Align Bench is a unified benchmark for analyzing misalignment across safety, value, and cultural dimensions.<n> SAVACU is an English-aligned dataset of 382,424 misaligned samples spanning 112 domains (or labels)
arXiv Detail & Related papers (2026-02-11T19:30:43Z) - Can Large Language Models Make Everyone Happy? [12.59854280011403]
Misalignment in Large Language Models (LLMs) refers to the failure to simultaneously satisfy safety, value, and cultural dimensions.<n>We introduce MisAlign-Profile, a unified benchmark for measuring misalignment trade-offs inspired by mechanistic profiling.
arXiv Detail & Related papers (2026-02-11T17:57:23Z) - MiJaBench: Revealing Minority Biases in Large Language Models via Hate Speech Jailbreaking [0.0]
MiJaBench is an adversarial benchmark comprising 44,000 prompts across 16 minority groups.<n>Defense rates fluctuate by up to 33% within the same model solely based on the target group.<n>We release all datasets and scripts to encourage research into granular demographic alignment at GitHub.
arXiv Detail & Related papers (2026-01-07T20:53:18Z) - Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL [64.3268313484078]
Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare.<n>Their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns.<n>We investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception.
arXiv Detail & Related papers (2025-10-16T05:29:36Z) - Red Teaming Multimodal Language Models: Evaluating Harm Across Prompt Modalities and Models [0.0]
Multimodal large language models (MLLMs) are increasingly used in real world applications, yet their safety under adversarial conditions remains underexplored.<n>This study evaluates the harmlessness of four leading MLLMs when exposed to adversarial prompts across text-only and multimodal formats.
arXiv Detail & Related papers (2025-09-18T22:51:06Z) - Revisiting LLM Value Probing Strategies: Are They Robust and Expressive? [81.49470136653665]
We evaluate the robustness and expressiveness of value representations across three widely used probing strategies.<n>We show that the demographic context has little effect on the free-text generation, and the models' values only weakly correlate with their preference for value-based actions.
arXiv Detail & Related papers (2025-07-17T18:56:41Z) - Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent Approach [31.925448597093407]
Large language models (LLMs) typically generate identical or similar responses for all users given the same prompt.<n> PENGUIN is a benchmark comprising 14,000 scenarios across seven sensitive domains with both context-rich and context-free variants.<n> RAISE is a training-free, two-stage agent framework that strategically acquires user-specific background.
arXiv Detail & Related papers (2025-05-24T21:37:10Z) - Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited.<n>We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z) - Value Compass Benchmarks: A Platform for Fundamental and Validated Evaluation of LLMs Values [76.70893269183684]
Large Language Models (LLMs) achieve remarkable breakthroughs.<n> aligning their values with humans has become imperative for their responsible development.<n>There still lack evaluations of LLMs values that fulfill three desirable goals.
arXiv Detail & Related papers (2025-01-13T05:53:56Z) - SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal [64.9938658716425]
SORRY-Bench is a proposed benchmark for evaluating large language models' (LLMs) ability to recognize and reject unsafe user requests.<n>First, existing methods often use coarse-grained taxonomy of unsafe topics, and are over-representing some fine-grained topics.<n>Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - Safety Assessment of Chinese Large Language Models [51.83369778259149]
Large language models (LLMs) may generate insulting and discriminatory content, reflect incorrect social values, and may be used for malicious purposes.
To promote the deployment of safe, responsible, and ethical AI, we release SafetyPrompts including 100k augmented prompts and responses by LLMs.
arXiv Detail & Related papers (2023-04-20T16:27:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.