How Does Personalized Memory Shape LLM Behavior? Benchmarking Rational Preference Utilization in Personalized Assistants
- URL: http://arxiv.org/abs/2601.16621v1
- Date: Fri, 23 Jan 2026 10:19:48 GMT
- Title: How Does Personalized Memory Shape LLM Behavior? Benchmarking Rational Preference Utilization in Personalized Assistants
- Authors: Xueyang Feng, Weinan Gan, Xu Chen, Quanyu Dai, Yong Liu,
- Abstract summary: Large language model (LLM)-powered assistants have recently integrated memory mechanisms that record user preferences, leading to more personalized and user-aligned responses.<n>We develop RPEval, a benchmark comprising a personalized intent reasoning dataset and a multi-granularity evaluation protocol.<n>RPEval reveals the widespread phenomenon of irrational personalization in existing LLMs and, through error pattern analysis, illustrates its negative impact on user experience.<n>We introduce RP-Reasoner, which treats memory utilization as a pragmatic reasoning process, enabling the selective integration of personalized information.
- Score: 25.83552447206606
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language model (LLM)-powered assistants have recently integrated memory mechanisms that record user preferences, leading to more personalized and user-aligned responses. However, irrelevant personalized memories are often introduced into the context, interfering with the LLM's intent understanding. To comprehensively investigate the dual effects of personalization, we develop RPEval, a benchmark comprising a personalized intent reasoning dataset and a multi-granularity evaluation protocol. RPEval reveals the widespread phenomenon of irrational personalization in existing LLMs and, through error pattern analysis, illustrates its negative impact on user experience. Finally, we introduce RP-Reasoner, which treats memory utilization as a pragmatic reasoning process, enabling the selective integration of personalized information. Experimental results demonstrate that our method significantly outperforms carefully designed baselines on RPEval, and resolves 80% of the bad cases observed in a large-scale commercial personalized assistant, highlighting the potential of pragmatic reasoning to mitigate irrational personalization. Our benchmark is publicly available at https://github.com/XueyangFeng/RPEval.
Related papers
- Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions [50.70965714314064]
Large Language Models (LLMs) are increasingly serving as personal assistants, where users share complex and diverse preferences over extended interactions.<n>This work proposes RealPref, a benchmark for evaluating realistic preference-following in personalized user-LLM interactions.
arXiv Detail & Related papers (2026-03-04T15:42:43Z) - Learning Personalized Agents from Human Feedback [36.47803872623135]
We introduce Personalized Agents from Human Feedback (PAHF), a framework for continual personalization.<n>PAHF learns online from live interaction using explicit per-user memory.<n> benchmarks quantify an agent's ability to learn initial preferences from scratch and subsequently adapt to persona shifts.
arXiv Detail & Related papers (2026-02-18T04:18:47Z) - Synthetic Interaction Data for Scalable Personalization in Large Language Models [67.31884245564086]
We introduce a high-fidelity synthetic data generation framework called PersonaGym.<n>Unlike prior work that treats personalization as static persona-preference pairs, PersonaGym models a dynamic preference process.<n>We release PersonaAtlas, a large-scale, high-quality, and diverse synthetic dataset of high-fidelity multi-turn personalized interaction trajectories.
arXiv Detail & Related papers (2026-02-12T20:41:22Z) - ALPBench: A Benchmark for Attribution-level Long-term Personal Behavior Understanding [53.88804678012327]
ALPBench is a Benchmark for Attribution-level Long-term Personal Behavior Understanding.<n>It predicts user-interested attribute combinations, enabling ground-truth evaluation.<n>It models preferences from long-term historical behaviors rather than users' explicitly expressed requests.
arXiv Detail & Related papers (2026-02-03T03:32:16Z) - OP-Bench: Benchmarking Over-Personalization for Memory-Augmented Personalized Conversational Agents [55.27061195244624]
We formalize over-personalization into three types: Irrelevance, Repetition, and Sycophancy.<n>Agents tend to retrieve and over-attend to user memories even when unnecessary.<n>Our work takes an initial step toward more controllable and appropriate personalization in memory-augmented dialogue systems.
arXiv Detail & Related papers (2026-01-20T08:27:13Z) - When Personalization Misleads: Understanding and Mitigating Hallucinations in Personalized LLMs [13.695058536403108]
We show that when personalized large language models (LLMs) face factual queries, the model generates answers aligned with a user's prior history rather than the objective truth.<n>We propose Factuality-Preserving Personalized Steering (FPPS), a lightweight inference-time approach that mitigates personalization-induced factual distortions.
arXiv Detail & Related papers (2026-01-16T05:20:10Z) - PersonaMem-v2: Towards Personalized Intelligence via Learning Implicit User Personas and Agentic Memory [56.81126490418336]
Personalization is one of the next milestones in advancing AI capability and alignment.<n> PersonaMem-v2 simulates 1,000 realistic user-chatbot interactions on 300+ scenarios, 20,000+ user preferences, and 128k-token context windows.<n>We train Qwen3-4B to outperforms GPT-5, reaching 53% accuracy in implicit personalization.
arXiv Detail & Related papers (2025-12-07T06:48:23Z) - Personalized Reasoning: Just-In-Time Personalization and Why LLMs Fail At It [81.50711040539566]
Current large language model (LLM) development treats task-solving and preference alignment as separate challenges.<n>We introduce PREFDISCO, an evaluation methodology that transforms static benchmarks into interactive personalization tasks.<n>Our framework creates scenarios where identical questions require different reasoning chains depending on user context.
arXiv Detail & Related papers (2025-09-30T18:55:28Z) - Addressing Personalized Bias for Unbiased Learning to Rank [56.663619153713434]
Unbiased learning to rank (ULTR) aims to learn unbiased ranking models from biased user behavior logs.<n>We propose a novel user-aware inverse-propensity-score estimator for learning-to-rank objectives.
arXiv Detail & Related papers (2025-08-28T14:01:31Z) - Explicit v.s. Implicit Memory: Exploring Multi-hop Complex Reasoning Over Personalized Information [13.292751023556221]
In large language model-based agents, memory serves as a critical capability for achieving personalization by storing and utilizing users' information.<n>We propose the multi-hop personalized reasoning task to explore how different memory mechanisms perform in multi-hop reasoning over personalized information.
arXiv Detail & Related papers (2025-08-18T13:34:37Z) - PRIME: Large Language Model Personalization with Cognitive Dual-Memory and Personalized Thought Process [6.27203886967197]
Large language model (LLM) personalization aims to align model outputs with individuals' unique preferences and opinions.<n>We introduce a unified framework, PRIME, using episodic and semantic memory mechanisms.<n>Experiments validate PRIME's effectiveness across both long- and short-context scenarios.
arXiv Detail & Related papers (2025-07-07T01:54:34Z) - NextQuill: Causal Preference Modeling for Enhancing LLM Personalization [82.15961484963256]
We introduce NextQuill, a novel personalization framework grounded in causal preference modeling.<n>Building on this insight, NextQuill introduces two complementary alignment strategies.<n> Experiments across multiple personalization benchmarks demonstrate that NextQuill significantly improves personalization quality.
arXiv Detail & Related papers (2025-06-03T02:08:55Z) - Personalized Language Models via Privacy-Preserving Evolutionary Model Merging [53.97323896430374]
Personalization in language models aims to tailor model behavior to individual users or user groups.<n>We propose Privacy-Preserving Model Merging via Evolutionary Algorithms (PriME)<n>PriME employs gradient-free methods to directly optimize utility while reducing privacy risks.<n>Experiments on the LaMP benchmark show that PriME consistently outperforms a range of baselines, achieving up to a 45% improvement in task performance.
arXiv Detail & Related papers (2025-03-23T09:46:07Z) - Democratizing Large Language Models via Personalized Parameter-Efficient Fine-tuning [36.88126051792774]
Personalization in large language models (LLMs) is increasingly important.<n>One PEFT Per User (OPPU) employs personalized parameter-efficient fine-tuning (PEFT) modules to store user-specific behavior patterns and preferences.<n>OPPU significantly outperforms existing prompt-based methods across seven diverse tasks in the LaMP benchmark.
arXiv Detail & Related papers (2024-02-06T21:03:52Z) - PeTra: A Sparsely Supervised Memory Model for People Tracking [50.98911178059019]
We propose PeTra, a memory-augmented neural network designed to track entities in its memory slots.
We empirically compare key modeling choices, finding that we can simplify several aspects of the design of the memory module while retaining strong performance.
PeTra is highly effective in both evaluations, demonstrating its ability to track people in its memory despite being trained with limited annotation.
arXiv Detail & Related papers (2020-05-06T17:45:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.