When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents
- URL: http://arxiv.org/abs/2601.17887v1
- Date: Sun, 25 Jan 2026 15:42:01 GMT
- Title: When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents
- Authors: Jiahe Guo, Xiangran Guo, Yulin Hu, Zimo Long, Xingyu Sui, Xuda Zhi, Yongbo Huang, Hao He, Weixiang Zhao, Yanyan Zhao, Bing Qin,
- Abstract summary: In this paper, we reveal intent legitimation, a previously underexplored safety failure in personalized agents.<n>Our work provides the first systematic exploration and evaluation of intent legitimation as a safety failure mode.
- Score: 49.341830745910194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long-term memory enables large language model (LLM) agents to support personalized and sustained interactions. However, most work on personalized agents prioritizes utility and user experience, treating memory as a neutral component and largely overlooking its safety implications. In this paper, we reveal intent legitimation, a previously underexplored safety failure in personalized agents, where benign personal memories bias intent inference and cause models to legitimize inherently harmful queries. To study this phenomenon, we introduce PS-Bench, a benchmark designed to identify and quantify intent legitimation in personalized interactions. Across multiple memory-augmented agent frameworks and base LLMs, personalization increases attack success rates by 15.8%-243.7% relative to stateless baselines. We further provide mechanistic evidence for intent legitimation from internal representations space, and propose a lightweight detection-reflection method that effectively reduces safety degradation. Overall, our work provides the first systematic exploration and evaluation of intent legitimation as a safety failure mode that naturally arises from benign, real-world personalization, highlighting the importance of assessing safety under long-term personal context. WARNING: This paper may contain harmful content.
Related papers
- Single-Pixel Vision-Language Model for Intrinsic Privacy-Preserving Behavioral Intelligence [55.512671026669516]
We propose the Single-Pixel Vision-Language Model (SP-VLM), a novel framework that reimagines secure environmental monitoring.<n>It achieves intrinsic privacy-by-design by capturing human dynamics through inherently low-dimensional single-pixel modalities.<n>We show that SP-VLM can nonetheless extract meaningful behavioral semantics, enabling robust anomaly detection, people counting, and activity understanding.
arXiv Detail & Related papers (2026-01-21T09:11:26Z) - ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs [48.50397204177239]
As large language models (LLMs) evolve, evaluating the safety of their actions becomes critical.<n>We introduce ManagerBench, a benchmark that evaluates LLM decision-making in realistic, human-validated managerial scenarios.<n>A parallel control set, where potential harm is directed only at inanimate objects, measures a model's pragmatism and identifies its tendency to be overly safe.
arXiv Detail & Related papers (2025-10-01T13:08:33Z) - IntentionReasoner: Facilitating Adaptive LLM Safeguards through Intent Reasoning and Selective Query Refinement [35.904652937034136]
We introduce IntentionReasoner, a novel safeguard mechanism that leverages a dedicated guard model to perform intent reasoning.<n>We show that IntentionReasoner excels in multiple safeguard benchmarks, generation quality evaluations, and jailbreak attack scenarios.
arXiv Detail & Related papers (2025-08-27T16:47:31Z) - Confidential Guardian: Cryptographically Prohibiting the Abuse of Model Abstention [65.47632669243657]
A dishonest institution can exploit mechanisms to discriminate or unjustly deny services under the guise of uncertainty.<n>We demonstrate the practicality of this threat by introducing an uncertainty-inducing attack called Mirage.<n>We propose Confidential Guardian, a framework that analyzes calibration metrics on a reference dataset to detect artificially suppressed confidence.
arXiv Detail & Related papers (2025-05-29T19:47:50Z) - Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent Approach [31.925448597093407]
Large language models (LLMs) typically generate identical or similar responses for all users given the same prompt.<n> PENGUIN is a benchmark comprising 14,000 scenarios across seven sensitive domains with both context-rich and context-free variants.<n> RAISE is a training-free, two-stage agent framework that strategically acquires user-specific background.
arXiv Detail & Related papers (2025-05-24T21:37:10Z) - Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation [52.83870601473094]
Embodied agents exhibit immense potential across a multitude of domains.<n>Existing research predominantly concentrates on the security of general large language models.<n>This paper introduces a novel input moderation framework, meticulously designed to safeguard embodied agents.
arXiv Detail & Related papers (2025-04-22T08:34:35Z) - Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users.<n>We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions.<n>We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z) - DePrompt: Desensitization and Evaluation of Personal Identifiable Information in Large Language Model Prompts [11.883785681042593]
DePrompt is a desensitization protection and effectiveness evaluation framework for prompt.
We integrate contextual attributes to define privacy types, achieving high-precision PII entity identification.
Our framework is adaptable to prompts and can be extended to text usability-dependent scenarios.
arXiv Detail & Related papers (2024-08-16T02:38:25Z) - Jailbreaking as a Reward Misspecification Problem [80.52431374743998]
We propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process.<n>We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness.<n>We present ReMiss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space.
arXiv Detail & Related papers (2024-06-20T15:12:27Z) - From Mean to Extreme: Formal Differential Privacy Bounds on the Success of Real-World Data Reconstruction Attacks [54.25638567385662]
Differential Privacy in machine learning is often interpreted as guarantees against membership inference.<n> translating DP budgets into quantitative protection against the more damaging threat of data reconstruction remains a challenging open problem.<n>This paper bridges the critical gap by deriving the first formal privacy bounds tailored to the mechanics of demonstrated "from-scratch" attacks.
arXiv Detail & Related papers (2024-02-20T09:52:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.