Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent Approach
- URL: http://arxiv.org/abs/2505.18882v2
- Date: Thu, 29 May 2025 23:48:38 GMT
- Title: Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent Approach
- Authors: Yuchen Wu, Edward Sun, Kaijie Zhu, Jianxun Lian, Jose Hernandez-Orallo, Aylin Caliskan, Jindong Wang,
- Abstract summary: Large language models (LLMs) typically generate identical or similar responses for all users given the same prompt.<n> PENGUIN is a benchmark comprising 14,000 scenarios across seven sensitive domains with both context-rich and context-free variants.<n> RAISE is a training-free, two-stage agent framework that strategically acquires user-specific background.
- Score: 17.5700128005813
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) typically generate identical or similar responses for all users given the same prompt, posing serious safety risks in high-stakes applications where user vulnerabilities differ widely. Existing safety evaluations primarily rely on context-independent metrics - such as factuality, bias, or toxicity - overlooking the fact that the same response may carry divergent risks depending on the user's background or condition. We introduce personalized safety to fill this gap and present PENGUIN - a benchmark comprising 14,000 scenarios across seven sensitive domains with both context-rich and context-free variants. Evaluating six leading LLMs, we demonstrate that personalized user information significantly improves safety scores by 43.2%, confirming the effectiveness of personalization in safety alignment. However, not all context attributes contribute equally to safety enhancement. To address this, we develop RAISE - a training-free, two-stage agent framework that strategically acquires user-specific background. RAISE improves safety scores by up to 31.6% over six vanilla LLMs, while maintaining a low interaction cost of just 2.7 user queries on average. Our findings highlight the importance of selective information gathering in safety-critical domains and offer a practical solution for personalizing LLM responses without model retraining. This work establishes a foundation for safety research that adapts to individual user contexts rather than assuming a universal harm standard.
Related papers
- Automated Safety Evaluations Across 20 Large Language Models: The Aymara LLM Risk and Responsibility Matrix [0.0]
Aymara AI is a programmatic platform for generating and administering customized, policy-grounded safety evaluations.<n>It transforms natural-language safety policies into adversarial prompts and scores model responses using an AI-based rater validated against human judgments.
arXiv Detail & Related papers (2025-07-19T18:49:16Z) - SafeLawBench: Towards Safe Alignment of Large Language Models [18.035407356604832]
There is a lack of definitive standards for evaluating the safety of large language models (LLMs)<n>SafeLawBench categorizes safety risks into three levels based on legal standards.<n>It comprises 24,860 multi-choice questions and 1,106 open-domain question-answering (QA) tasks.
arXiv Detail & Related papers (2025-06-07T03:09:59Z) - Automating Safety Enhancement for LLM-based Agents with Synthetic Risk Scenarios [77.86600052899156]
Large Language Model (LLM)-based agents are increasingly deployed in real-world applications.<n>We propose AutoSafe, the first framework that systematically enhances agent safety through fully automated synthetic data generation.<n>We show that AutoSafe boosts safety scores by 45% on average and achieves a 28.91% improvement on real-world tasks.
arXiv Detail & Related papers (2025-05-23T10:56:06Z) - Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models [26.667869862556973]
We introduce U-SAFEBENCH, the first benchmark to assess user-specific aspect of LLM safety.<n>Our evaluation of 18 widely used LLMs reveals current LLMs fail to act safely when considering user-specific safety standards.<n>We propose a simple remedy based on chain-of-thought, demonstrating its effectiveness in improving user-specific safety.
arXiv Detail & Related papers (2025-02-20T22:58:44Z) - Agent-SafetyBench: Evaluating the Safety of LLM Agents [72.92604341646691]
We introduce Agent-SafetyBench, a benchmark designed to evaluate the safety of large language models (LLMs)<n>Agent-SafetyBench encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions.<n>Our evaluation of 16 popular LLM agents reveals a concerning result: none of the agents achieves a safety score above 60%.
arXiv Detail & Related papers (2024-12-19T02:35:15Z) - Multimodal Situational Safety [73.63981779844916]
We present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety.<n>For an MLLM to respond safely, whether through language or action, it often needs to assess the safety implications of a language query within its corresponding visual context.<n>We develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs.
arXiv Detail & Related papers (2024-10-08T16:16:07Z) - ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy.
It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z) - R-Judge: Benchmarking Safety Risk Awareness for LLM Agents [28.0550468465181]
Large language models (LLMs) have exhibited great potential in autonomously completing tasks across real-world applications.
This work addresses the imperative need for benchmarking the behavioral safety of LLM agents within diverse environments.
We introduce R-Judge, a benchmark crafted to evaluate the proficiency of LLMs in judging and identifying safety risks given agent interaction records.
arXiv Detail & Related papers (2024-01-18T14:40:46Z) - A Security Risk Taxonomy for Prompt-Based Interaction With Large Language Models [5.077431021127288]
This paper addresses a gap in current research by focusing on security risks posed by large language models (LLMs)
Our work proposes a taxonomy of security risks along the user-model communication pipeline and categorizes the attacks by target and attack type alongside the commonly used confidentiality, integrity, and availability (CIA) triad.
arXiv Detail & Related papers (2023-11-19T20:22:05Z) - ASSERT: Automated Safety Scenario Red Teaming for Evaluating the
Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection.
We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance.
We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z) - Safety Assessment of Chinese Large Language Models [51.83369778259149]
Large language models (LLMs) may generate insulting and discriminatory content, reflect incorrect social values, and may be used for malicious purposes.
To promote the deployment of safe, responsible, and ethical AI, we release SafetyPrompts including 100k augmented prompts and responses by LLMs.
arXiv Detail & Related papers (2023-04-20T16:27:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.