Related papers: SHARP: Social Harm Analysis via Risk Profiles for Measuring Inequities in Large Language Models

SHARP: Social Harm Analysis via Risk Profiles for Measuring Inequities in Large Language Models

URL: http://arxiv.org/abs/2601.21235v1
Date: Thu, 29 Jan 2026 03:54:25 GMT
Title: SHARP: Social Harm Analysis via Risk Profiles for Measuring Inequities in Large Language Models
Authors: Alok Abhishek, Tushar Bandopadhyay, Lisa Erickson,
Abstract summary: This paper introduces Social Harm Analysis via Risk Profiles, a framework for multidimensional, distribution-aware evaluation of social harm.<n>It shows that models with similar average risk can exhibit more than twofold differences in tail exposure and volatility.
Score: 0.5599792629509229
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly deployed in high-stakes domains, where rare but severe failures can result in irreversible harm. However, prevailing evaluation benchmarks often reduce complex social risk to mean-centered scalar scores, thereby obscuring distributional structure, cross-dimensional interactions, and worst-case behavior. This paper introduces Social Harm Analysis via Risk Profiles (SHARP), a framework for multidimensional, distribution-aware evaluation of social harm. SHARP models harm as a multivariate random variable and integrates explicit decomposition into bias, fairness, ethics, and epistemic reliability with a union-of-failures aggregation reparameterized as additive cumulative log-risk. The framework further employs risk-sensitive distributional statistics, with Conditional Value at Risk (CVaR95) as a primary metric, to characterize worst-case model behavior. Application of SHARP to eleven frontier LLMs, evaluated on a fixed corpus of n=901 socially sensitive prompts, reveals that models with similar average risk can exhibit more than twofold differences in tail exposure and volatility. Across models, dimension-wise marginal tail behavior varies systematically across harm dimensions, with bias exhibiting the strongest tail severities, epistemic and fairness risks occupying intermediate regimes, and ethical misalignment consistently lower; together, these patterns reveal heterogeneous, model-dependent failure structures that scalar benchmarks conflate. These findings indicate that responsible evaluation and governance of LLMs require moving beyond scalar averages toward multidimensional, tail-sensitive risk profiling.

Related papers

On the Generalization and Robustness in Conditional Value-at-Risk [12.253712889424584]
We develop a learning-theoretic analysis of Conditional Value-at-Risk (CVaR)-based empirical risk minimization under heavy-tailed and contaminated data.<n>We establish sharp, high-probability generalization and excess risk bounds under minimal moment assumptions.<n>We show that CVaR decisions themselves can be intrinsically unstable under heavy tails.
arXiv Detail & Related papers (2026-02-20T08:10:11Z)
Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures [70.48661957773449]
Emergent Misalignment refers to a failure mode in which fine-tuning large language models on narrowly scoped data induces broadly misaligned behavior.<n>Across multiple domains and model families, we find that fine-tuning models on data exhibiting specific character-level dispositions induces substantially stronger and more transferable misalignment than incorrect-advice fine-tuning.
arXiv Detail & Related papers (2026-01-30T15:28:42Z)
Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling [50.872910438715486]
Large Language Models (LLMs) are typically evaluated for safety under single-shot or low-budget adversarial prompting.<n>We propose a scaling-aware Best-of-N estimation of risk, SABER, for modeling jailbreak vulnerability under Best-of-N sampling.
arXiv Detail & Related papers (2026-01-30T06:54:35Z)
The Shadow Self: Intrinsic Value Misalignment in Large Language Model Agents [37.75212140218036]
We formalize the Loss-of-Control risk and identify the previously underexamined Intrinsic Value Misalignment (Intrinsic VM)<n>We then introduce IMPRESS, a scenario-driven framework for systematically assessing this risk.<n>We evaluate Intrinsic VM on 21 state-of-the-art LLM agents and find that it is a common and broadly observed safety risk across models.
arXiv Detail & Related papers (2026-01-24T07:09:50Z)
RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration [81.38705556267917]
Existing safety evaluation methods for large language models (LLMs) suffer from inherent limitations.<n>We introduce a theoretical framework that reconstructs the underlying risk concept space.<n>We propose RADAR, a multi-agent collaborative evaluation framework.
arXiv Detail & Related papers (2025-09-28T09:35:32Z)
Exploring the Secondary Risks of Large Language Models [26.00748215572094]
We introduce secondary risks marked by harmful or misleading behaviors during benign prompts.<n>Unlike adversarial attacks, these risks stem from imperfect generalization and often evade standard safety mechanisms.<n>We propose SecLens, a black-box, multi-objective search framework that efficiently elicits secondary risk behaviors.
arXiv Detail & Related papers (2025-06-14T07:31:52Z)
Conformal Tail Risk Control for Large Language Model Alignment [9.69785515652571]
General-purpose scoring models have been created to automate the process of quantifying tail events.<n>This phenomenon introduces potential human-machine misalignment between the respective scoring mechanisms.<n>We present a lightweight calibration framework for blackbox models that ensures the alignment of humans and machines with provable guarantees.
arXiv Detail & Related papers (2025-02-27T17:10:54Z)
Controlling Risk of Retrieval-augmented Generation: A Counterfactual Prompting Framework [77.45983464131977]
We focus on how likely it is that a RAG model's prediction is incorrect, resulting in uncontrollable risks in real-world applications.<n>Our research identifies two critical latent factors affecting RAG's confidence in its predictions.<n>We develop a counterfactual prompting framework that induces the models to alter these factors and analyzes the effect on their answers.
arXiv Detail & Related papers (2024-09-24T14:52:14Z)
Prediction Risk and Estimation Risk of the Ridgeless Least Squares Estimator under General Assumptions on Regression Errors [10.857775300638831]
We explore prediction risk as well as estimation risk under more general regression error assumptions. Our findings suggest that the benefits of over parameterization can extend to time series, panel and grouped data.
arXiv Detail & Related papers (2023-05-22T10:04:20Z)
Mitigating multiple descents: A model-agnostic framework for risk monotonization [84.6382406922369]
We develop a general framework for risk monotonization based on cross-validation. We propose two data-driven methodologies, namely zero- and one-step, that are akin to bagging and boosting.
arXiv Detail & Related papers (2022-05-25T17:41:40Z)
A General Framework for Survival Analysis and Multi-State Modelling [70.31153478610229]
We use neural ordinary differential equations as a flexible and general method for estimating multi-state survival models. We show that our model exhibits state-of-the-art performance on popular survival data sets and demonstrate its efficacy in a multi-state setting.
arXiv Detail & Related papers (2020-06-08T19:24:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.