Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions
- URL: http://arxiv.org/abs/2603.04191v1
- Date: Wed, 04 Mar 2026 15:42:43 GMT
- Title: Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions
- Authors: Qianyun Guo, Yibo Li, Yue Liu, Bryan Hooi,
- Abstract summary: Large Language Models (LLMs) are increasingly serving as personal assistants, where users share complex and diverse preferences over extended interactions.<n>This work proposes RealPref, a benchmark for evaluating realistic preference-following in personalized user-LLM interactions.
- Score: 50.70965714314064
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are increasingly serving as personal assistants, where users share complex and diverse preferences over extended interactions. However, assessing how well LLMs can follow these preferences in realistic, long-term situations remains underexplored. This work proposes RealPref, a benchmark for evaluating realistic preference-following in personalized user-LLM interactions. RealPref features 100 user profiles, 1300 personalized preferences, four types of preference expression (ranging from explicit to implicit), and long-horizon interaction histories. It includes three types of test questions (multiple-choice, true-or-false, and open-ended), with detailed rubrics for LLM-as-a-judge evaluation. Results indicate that LLM performance significantly drops as context length grows and preference expression becomes more implicit, and that generalizing user preference understanding to unseen scenarios poses further challenges. RealPref and these findings provide a foundation for future research to develop user-aware LLM assistants that better adapt to individual needs. The code is available at https://github.com/GG14127/RealPref.
Related papers
- ALPBench: A Benchmark for Attribution-level Long-term Personal Behavior Understanding [53.88804678012327]
ALPBench is a Benchmark for Attribution-level Long-term Personal Behavior Understanding.<n>It predicts user-interested attribute combinations, enabling ground-truth evaluation.<n>It models preferences from long-term historical behaviors rather than users' explicitly expressed requests.
arXiv Detail & Related papers (2026-02-03T03:32:16Z) - Do LLMs Recognize Your Latent Preferences? A Benchmark for Latent Information Discovery in Personalized Interaction [40.857161437572465]
We introduce a benchmark for evaluating latent information discovery in personalized interaction.<n>The benchmark spans three progressively realistic settings: the classic 20 Questions game, Personalized Question Answering, and Personalized Text Summarization.<n>Our results reveal that while LLMs can indeed surface latent information through dialogue, their success varies dramatically with context.
arXiv Detail & Related papers (2025-10-20T03:58:49Z) - MLLM as a UI Judge: Benchmarking Multimodal LLMs for Predicting Human Perception of User Interfaces [97.62557395494962]
We use crowdsourcing to benchmark GPT-4o, Claude, and Llama across 30 interfaces.<n>Our results show that MLLMs approximate human preferences on some dimensions but diverge on others.
arXiv Detail & Related papers (2025-10-09T20:00:41Z) - CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions [39.554239954719876]
CUPID is a benchmark of 756 human-curated interaction session histories.<n>We evaluate 10 open and proprietary Large Language Models (LLMs)<n>Our work highlights the need to advance LLM capabilities for more contextually personalized interactions.
arXiv Detail & Related papers (2025-08-03T09:04:48Z) - Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale [53.059480071818136]
Large Language Models (LLMs) have emerged as personalized assistants for users across a wide range of tasks.<n> PERSONAMEM features curated user profiles with over 180 simulated user-LLM interaction histories.<n>We evaluate LLM chatbots' ability to identify the most suitable response according to the current state of the user's profile.
arXiv Detail & Related papers (2025-04-19T08:16:10Z) - Aligning LLMs with Individual Preferences via Interaction [51.72200436159636]
We train large language models (LLMs) that can ''interact to align''<n>We develop a multi-turn preference dataset containing 3K+ multi-turn conversations in tree structures.<n>For evaluation, we establish the ALOE benchmark, consisting of 100 carefully selected examples and well-designed metrics to measure the customized alignment performance during conversations.
arXiv Detail & Related papers (2024-10-04T17:48:29Z) - PersonalLLM: Tailoring LLMs to Individual Preferences [11.717169516971856]
We present a public benchmark, PersonalLLM, focusing on adapting LLMs to provide maximal benefits for a particular user.<n>We curate open-ended prompts paired with many high-quality answers over which users would be expected to display heterogeneous latent preferences.<n>Our dataset and generated personalities offer an innovative testbed for developing personalization algorithms.
arXiv Detail & Related papers (2024-09-30T13:55:42Z) - Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts.
RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.