Related papers: When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents

When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents

URL: http://arxiv.org/abs/2601.17887v1
Date: Sun, 25 Jan 2026 15:42:01 GMT
Title: When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents
Authors: Jiahe Guo, Xiangran Guo, Yulin Hu, Zimo Long, Xingyu Sui, Xuda Zhi, Yongbo Huang, Hao He, Weixiang Zhao, Yanyan Zhao, Bing Qin,
Abstract summary: In this paper, we reveal intent legitimation, a previously underexplored safety failure in personalized agents.<n>Our work provides the first systematic exploration and evaluation of intent legitimation as a safety failure mode.
Score: 49.341830745910194
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long-term memory enables large language model (LLM) agents to support personalized and sustained interactions. However, most work on personalized agents prioritizes utility and user experience, treating memory as a neutral component and largely overlooking its safety implications. In this paper, we reveal intent legitimation, a previously underexplored safety failure in personalized agents, where benign personal memories bias intent inference and cause models to legitimize inherently harmful queries. To study this phenomenon, we introduce PS-Bench, a benchmark designed to identify and quantify intent legitimation in personalized interactions. Across multiple memory-augmented agent frameworks and base LLMs, personalization increases attack success rates by 15.8%-243.7% relative to stateless baselines. We further provide mechanistic evidence for intent legitimation from internal representations space, and propose a lightweight detection-reflection method that effectively reduces safety degradation. Overall, our work provides the first systematic exploration and evaluation of intent legitimation as a safety failure mode that naturally arises from benign, real-world personalization, highlighting the importance of assessing safety under long-term personal context. WARNING: This paper may contain harmful content.

Related papers

Single-Pixel Vision-Language Model for Intrinsic Privacy-Preserving Behavioral Intelligence [55.512671026669516]
We propose the Single-Pixel Vision-Language Model (SP-VLM), a novel framework that reimagines secure environmental monitoring.<n>It achieves intrinsic privacy-by-design by capturing human dynamics through inherently low-dimensional single-pixel modalities.<n>We show that SP-VLM can nonetheless extract meaningful behavioral semantics, enabling robust anomaly detection, people counting, and activity understanding.
arXiv Detail & Related papers (2026-01-21T09:11:26Z)
ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs [48.50397204177239]
As large language models (LLMs) evolve, evaluating the safety of their actions becomes critical.<n>We introduce ManagerBench, a benchmark that evaluates LLM decision-making in realistic, human-validated managerial scenarios.<n>A parallel control set, where potential harm is directed only at inanimate objects, measures a model's pragmatism and identifies its tendency to be overly safe.
arXiv Detail & Related papers (2025-10-01T13:08:33Z)
IntentionReasoner: Facilitating Adaptive LLM Safeguards through Intent Reasoning and Selective Query Refinement [35.904652937034136]
We introduce IntentionReasoner, a novel safeguard mechanism that leverages a dedicated guard model to perform intent reasoning.<n>We show that IntentionReasoner excels in multiple safeguard benchmarks, generation quality evaluations, and jailbreak attack scenarios.
arXiv Detail & Related papers (2025-08-27T16:47:31Z)
Confidential Guardian: Cryptographically Prohibiting the Abuse of Model Abstention [65.47632669243657]
A dishonest institution can exploit mechanisms to discriminate or unjustly deny services under the guise of uncertainty.<n>We demonstrate the practicality of this threat by introducing an uncertainty-inducing attack called Mirage.<n>We propose Confidential Guardian, a framework that analyzes calibration metrics on a reference dataset to detect artificially suppressed confidence.
arXiv Detail & Related papers (2025-05-29T19:47:50Z)
Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent Approach [31.925448597093407]
Large language models (LLMs) typically generate identical or similar responses for all users given the same prompt.<n> PENGUIN is a benchmark comprising 14,000 scenarios across seven sensitive domains with both context-rich and context-free variants.<n> RAISE is a training-free, two-stage agent framework that strategically acquires user-specific background.
arXiv Detail & Related papers (2025-05-24T21:37:10Z)
Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation [52.83870601473094]
Embodied agents exhibit immense potential across a multitude of domains.<n>Existing research predominantly concentrates on the security of general large language models.<n>This paper introduces a novel input moderation framework, meticulously designed to safeguard embodied agents.
arXiv Detail & Related papers (2025-04-22T08:34:35Z)
Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users.<n>We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions.<n>We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z)
DePrompt: Desensitization and Evaluation of Personal Identifiable Information in Large Language Model Prompts [11.883785681042593]
DePrompt is a desensitization protection and effectiveness evaluation framework for prompt. We integrate contextual attributes to define privacy types, achieving high-precision PII entity identification. Our framework is adaptable to prompts and can be extended to text usability-dependent scenarios.
arXiv Detail & Related papers (2024-08-16T02:38:25Z)
Jailbreaking as a Reward Misspecification Problem [80.52431374743998]
We propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process.<n>We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness.<n>We present ReMiss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space.
arXiv Detail & Related papers (2024-06-20T15:12:27Z)
From Mean to Extreme: Formal Differential Privacy Bounds on the Success of Real-World Data Reconstruction Attacks [54.25638567385662]
Differential Privacy in machine learning is often interpreted as guarantees against membership inference.<n> translating DP budgets into quantitative protection against the more damaging threat of data reconstruction remains a challenging open problem.<n>This paper bridges the critical gap by deriving the first formal privacy bounds tailored to the mechanics of demonstrated "from-scratch" attacks.
arXiv Detail & Related papers (2024-02-20T09:52:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.