Related papers: PluriHarms: Benchmarking the Full Spectrum of Human Judgments on AI Harm

PluriHarms: Benchmarking the Full Spectrum of Human Judgments on AI Harm

URL: http://arxiv.org/abs/2601.08951v1
Date: Tue, 13 Jan 2026 19:41:11 GMT
Title: PluriHarms: Benchmarking the Full Spectrum of Human Judgments on AI Harm
Authors: Jing-Jing Li, Joel Mire, Eve Fleisig, Valentina Pyatkin, Anne Collins, Maarten Sap, Sydney Levine,
Abstract summary: Current AI safety frameworks, which often treat harmfulness as binary, lack the flexibility to handle borderline cases where humans disagree.<n>We introduce PluriHarms, a benchmark designed to study human harm judgments across two key dimensions -- the harm axis (benign to harmful) and the agreement axis (agreement to disagreement)<n>Our scalable framework generates prompts that capture diverse AI harms and human values while targeting cases with high disagreement rates, validated by human data.
Score: 39.043933213898136
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current AI safety frameworks, which often treat harmfulness as binary, lack the flexibility to handle borderline cases where humans meaningfully disagree. To build more pluralistic systems, it is essential to move beyond consensus and instead understand where and why disagreements arise. We introduce PluriHarms, a benchmark designed to systematically study human harm judgments across two key dimensions -- the harm axis (benign to harmful) and the agreement axis (agreement to disagreement). Our scalable framework generates prompts that capture diverse AI harms and human values while targeting cases with high disagreement rates, validated by human data. The benchmark includes 150 prompts with 15,000 ratings from 100 human annotators, enriched with demographic and psychological traits and prompt-level features of harmful actions, effects, and values. Our analyses show that prompts that relate to imminent risks and tangible harms amplify perceived harmfulness, while annotator traits (e.g., toxicity experience, education) and their interactions with prompt content explain systematic disagreement. We benchmark AI safety models and alignment methods on PluriHarms, finding that while personalization significantly improves prediction of human harm judgments, considerable room remains for future progress. By explicitly targeting value diversity and disagreement, our work provides a principled benchmark for moving beyond "one-size-fits-all" safety toward pluralistically safe AI.

Related papers

In Quest of an Extensible Multi-Level Harm Taxonomy for Adversarial AI: Heart of Security, Ethical Risk Scoring and Resilience Analytics [0.0]
Harm is invoked everywhere from cybersecurity, ethics, risk analysis, to adversarial AI.<n>Current discourse relies on vague, under specified notions of harm, rendering nuanced, structured, and qualitative assessment effectively impossible.<n>We introduce a structured and expandable taxonomy of harms, grounded in an ensemble of contemporary ethical theories.
arXiv Detail & Related papers (2026-01-23T17:44:05Z)
Why They Disagree: Decoding Differences in Opinions about AI Risk on the Lex Fridman Podcast [0.0]
This paper analyzes contemporary debates about AI risk.<n>We find that differences in perspectives about existential risk ("X-risk") arise from differences in causal premises about design vs. emergence in complex systems.<n>Disagreements about these two forms of AI risk appear to share two properties: neither involves significant disagreements on moral values and both can be described in terms of differing views on the extent of boundedness of human rationality.
arXiv Detail & Related papers (2025-12-06T08:48:30Z)
EmoRAG: Evaluating RAG Robustness to Symbolic Perturbations [57.97838850473147]
Retrieval-Augmented Generation (RAG) systems are increasingly central to robust AI.<n>Our study unveils a critical, overlooked vulnerability: their susceptibility to subtle symbolic perturbations.<n>We demonstrate that injecting a single emoticon into a query makes it nearly 100% likely to retrieve semantically unrelated texts.
arXiv Detail & Related papers (2025-12-01T06:53:49Z)
Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments [4.547649832854566]
Large language models (LLMs) increasingly operate in multi-agent and safety-critical settings, raising open questions about how their vulnerabilities scale when models interact adversarially.<n>This study examines whether larger models can systematically mitigate smaller ones - eliciting harmful restricted behavior despite alignment safeguards.
arXiv Detail & Related papers (2025-11-16T15:16:33Z)
AI Harmonics: a human-centric and harms severity-adaptive AI risk assessment framework [4.84912384919978]
Existing AI risk assessment models focus on internal compliance, often neglecting diverse stakeholder perspectives and real-world consequences.<n>We propose a paradigm shift to a human-centric, harm-severity adaptive approach grounded in empirical incident data.<n>We present AI Harmonics, which includes a novel AI harm assessment metric (AIH) that leverages ordinal severity data to capture relative impact without requiring precise numerical estimates.
arXiv Detail & Related papers (2025-09-12T09:52:45Z)
ANNIE: Be Careful of Your Robots [48.89876809734855]
We present the first systematic study of adversarial safety attacks on embodied AI systems.<n>We show attack success rates exceeding 50% across all safety categories.<n>Results expose a previously underexplored but highly consequential attack surface in embodied AI systems.
arXiv Detail & Related papers (2025-09-03T15:00:28Z)
Rethinking How AI Embeds and Adapts to Human Values: Challenges and Opportunities [0.6113558800822273]
We argue that AI systems should implement long-term reasoning and remain adaptable to evolving values.<n>Value alignment requires more theories to address the full spectrum of human values.<n>We identify the challenges associated with value alignment and indicate directions for advancing value alignment research.
arXiv Detail & Related papers (2025-08-23T18:19:05Z)
Decoding Safety Feedback from Diverse Raters: A Data-driven Lens on Responsiveness to Severity [27.898678946802438]
We introduce a novel data-driven approach for interpreting granular ratings in pluralistic datasets.<n>We distill non-parametric responsiveness metrics that quantify the consistency of raters in scoring varying levels of the severity of safety violations.<n>We show that our approach can inform rater selection and feedback interpretation by capturing nuanced viewpoints across different demographic groups.
arXiv Detail & Related papers (2025-03-07T17:32:31Z)
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.374792825813394]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability.<n>The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z)
The Superalignment of Superhuman Intelligence with Large Language Models [63.96120398355404]
We discuss the concept of superalignment from the learning perspective to answer this question.<n>We highlight some key research problems in superalignment, namely, weak-to-strong generalization, scalable oversight, and evaluation.<n>We present a conceptual framework for superalignment, which consists of three modules: an attacker which generates adversary queries trying to expose the weaknesses of a learner model; a learner which will refine itself by learning from scalable feedbacks generated by a critic model along with minimal human experts; and a critic which generates critics or explanations for a given query-response pair, with a target of improving the learner by criticizing.
arXiv Detail & Related papers (2024-12-15T10:34:06Z)
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection. We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance. We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.