Challenges of Evaluating LLM Safety for User Welfare
- URL: http://arxiv.org/abs/2512.10687v1
- Date: Thu, 11 Dec 2025 14:34:40 GMT
- Title: Challenges of Evaluating LLM Safety for User Welfare
- Authors: Manon Kempermann, Sai Suresh Macharla Vasu, Mahalakshmi Raveenthiran, Theo Farrell, Ingmar Weber,
- Abstract summary: We argue that developing user-welfare safety evaluations is non-trivial due to fundamental questions about accounting for user context in evaluation design.<n>We evaluate advice on finance and health from GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across user profiles of varying vulnerability.<n>Our work establishes that effective user-welfare safety evaluation requires evaluators to assess responses against diverse user profiles.
- Score: 0.3749446315124487
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Safety evaluations of large language models (LLMs) typically focus on universal risks like dangerous capabilities or undesirable propensities. However, millions use LLMs for personal advice on high-stakes topics like finance and health, where harms are context-dependent rather than universal. While frameworks like the OECD's AI classification recognize the need to assess individual risks, user-welfare safety evaluations remain underdeveloped. We argue that developing such evaluations is non-trivial due to fundamental questions about accounting for user context in evaluation design. In this exploratory study, we evaluated advice on finance and health from GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across user profiles of varying vulnerability. First, we demonstrate that evaluators must have access to rich user context: identical LLM responses were rated significantly safer by context-blind evaluators than by those aware of user circumstances, with safety scores for high-vulnerability users dropping from safe (5/7) to somewhat unsafe (3/7). One might assume this gap could be addressed by creating realistic user prompts containing key contextual information. However, our second study challenges this: we rerun the evaluation on prompts containing context users report they would disclose, finding no significant improvement. Our work establishes that effective user-welfare safety evaluation requires evaluators to assess responses against diverse user profiles, as realistic user context disclosure alone proves insufficient, particularly for vulnerable populations. By demonstrating a methodology for context-aware evaluation, this study provides both a starting point for such assessments and foundational evidence that evaluating individual welfare demands approaches distinct from existing universal-risk frameworks. We publish our code and dataset to aid future developments.
Related papers
- SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models [60.8821834954637]
We present SafeRBench, the first benchmark that assesses LRM safety end-to-end.<n>We pioneer the incorporation of risk categories and levels into input design.<n>We introduce a micro-thought chunking mechanism to segment long reasoning traces into semantically coherent units.
arXiv Detail & Related papers (2025-11-19T06:46:33Z) - The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs [57.1838332916627]
Large Language Models (LLMs) have shown remarkable capabilities in Natural Language Processing (NLP)<n>Their widespread deployment has also raised significant safety concerns.<n>LLMs-generated content can exhibit unsafe behaviors such as toxicity, bias, or misinformation, especially in adversarial contexts.
arXiv Detail & Related papers (2025-06-06T05:50:50Z) - Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent Approach [31.925448597093407]
Large language models (LLMs) typically generate identical or similar responses for all users given the same prompt.<n> PENGUIN is a benchmark comprising 14,000 scenarios across seven sensitive domains with both context-rich and context-free variants.<n> RAISE is a training-free, two-stage agent framework that strategically acquires user-specific background.
arXiv Detail & Related papers (2025-05-24T21:37:10Z) - AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.374792825813394]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability.<n>The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z) - CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants [5.7605009639020315]
Assessment of ten leading models across five scenarios (with 337 use cases each)<n>We find that top-rated "harmless" models make recommendations that should be recognised as obviously harmful to the user given the context provided.<n>Key failure modes include inappropriate weighing of conflicting preferences, sycophancy (prioritising desires above safety), a lack of attentiveness to critical user information within the context window, and inconsistent application of user-specific knowledge.
arXiv Detail & Related papers (2024-10-28T15:59:31Z) - SAFETY-J: Evaluating Safety with Critique [24.723999605458832]
We introduce SAFETY-J, a bilingual generative safety evaluator for English and Chinese with critique-based judgment.
We establish an automated meta-evaluation benchmark that objectively assesses the quality of critiques with minimal human intervention.
Our evaluations demonstrate that SAFETY-J provides more nuanced and accurate safety evaluations, thereby enhancing both critique quality and predictive reliability in complex content scenarios.
arXiv Detail & Related papers (2024-07-24T08:04:00Z) - ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy.
It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z) - Safety Assessment of Chinese Large Language Models [51.83369778259149]
Large language models (LLMs) may generate insulting and discriminatory content, reflect incorrect social values, and may be used for malicious purposes.
To promote the deployment of safe, responsible, and ethical AI, we release SafetyPrompts including 100k augmented prompts and responses by LLMs.
arXiv Detail & Related papers (2023-04-20T16:27:35Z) - Foveate, Attribute, and Rationalize: Towards Physically Safe and
Trustworthy AI [76.28956947107372]
Covertly unsafe text is an area of particular interest, as such text may arise from everyday scenarios and are challenging to detect as harmful.
We propose FARM, a novel framework leveraging external knowledge for trustworthy rationale generation in the context of safety.
Our experiments show that FARM obtains state-of-the-art results on the SafeText dataset, showing absolute improvement in safety classification accuracy by 5.9%.
arXiv Detail & Related papers (2022-12-19T17:51:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.