Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale
- URL: http://arxiv.org/abs/2504.14225v1
- Date: Sat, 19 Apr 2025 08:16:10 GMT
- Title: Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale
- Authors: Bowen Jiang, Zhuoqun Hao, Young-Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo J. Taylor, Dan Roth,
- Abstract summary: Large Language Models (LLMs) have emerged as personalized assistants for users across a wide range of tasks.<n> PERSONAMEM features curated user profiles with over 180 simulated user-LLM interaction histories.<n>We evaluate LLM chatbots' ability to identify the most suitable response according to the current state of the user's profile.
- Score: 51.9706400130481
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large Language Models (LLMs) have emerged as personalized assistants for users across a wide range of tasks -- from offering writing support to delivering tailored recommendations or consultations. Over time, the interaction history between a user and an LLM can provide extensive information about an individual's traits and preferences. However, open questions remain on how well LLMs today can effectively leverage such history to (1) internalize the user's inherent traits and preferences, (2) track how the user profiling and preferences evolve over time, and (3) generate personalized responses accordingly in new scenarios. In this work, we introduce the PERSONAMEM benchmark. PERSONAMEM features curated user profiles with over 180 simulated user-LLM interaction histories, each containing up to 60 sessions of multi-turn conversations across 15 real-world tasks that require personalization. Given an in-situ user query, i.e. query issued by the user from the first-person perspective, we evaluate LLM chatbots' ability to identify the most suitable response according to the current state of the user's profile. We observe that current LLMs still struggle to recognize the dynamic evolution in users' profiles over time through direct prompting approaches. As a consequence, LLMs often fail to deliver responses that align with users' current situations and preferences, with frontier models such as GPT-4.1, o4-mini, GPT-4.5, o1, or Gemini-2.0 achieving only around 50% overall accuracy, suggesting room for improvement. We hope that PERSONAMEM, along with the user profile and conversation simulation pipeline, can facilitate future research in the development of truly user-aware chatbots. Code and data are available at github.com/bowen-upenn/PersonaMem.
Related papers
- UQABench: Evaluating User Embedding for Prompting LLMs in Personalized Question Answering [39.79275025010785]
name is a benchmark designed to evaluate the effectiveness of user embeddings in prompting large language models for personalization.<n>We conduct extensive experiments on various state-of-the-art methods for modeling user embeddings.
arXiv Detail & Related papers (2025-02-26T14:34:00Z) - Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models [11.708400514900053]
We employ two proprietary Large Language Models (LLMs) to generate a heterogeneous base of user profiles.
We perform a detailed analysis of the user profiles generated by LLMs to assess the diversity, consistency, and potential biases inherent in these simulations.
We find that GPT-o1 generates more heterogeneous user distribution across most user attributes, while GPT-4o generates more skewed user attributes.
arXiv Detail & Related papers (2025-02-18T12:20:16Z) - Optimizing Data Delivery: Insights from User Preferences on Visuals, Tables, and Text [59.68239795065175]
We conduct a user study where users are shown a question and asked what they would prefer to see.
We use the data to establish that a user's personal traits does influence the data outputs that they prefer.
arXiv Detail & Related papers (2024-11-12T00:24:31Z) - RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models [53.304699445700926]
We introduce the Retrieval Augmented Personalization framework for MLLMs' personalization.
Starting from a general MLLM, we turn it into a personalized assistant in three steps.
By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning.
arXiv Detail & Related papers (2024-10-17T09:10:26Z) - Aligning LLMs with Individual Preferences via Interaction [51.72200436159636]
We train large language models (LLMs) that can ''interact to align''<n>We develop a multi-turn preference dataset containing 3K+ multi-turn conversations in tree structures.<n>For evaluation, we establish the ALOE benchmark, consisting of 100 carefully selected examples and well-designed metrics to measure the customized alignment performance during conversations.
arXiv Detail & Related papers (2024-10-04T17:48:29Z) - PersonalLLM: Tailoring LLMs to Individual Preferences [11.717169516971856]
We present a public benchmark, PersonalLLM, focusing on adapting LLMs to provide maximal benefits for a particular user.<n>We curate open-ended prompts paired with many high-quality answers over which users would be expected to display heterogeneous latent preferences.<n>Our dataset and generated personalities offer an innovative testbed for developing personalization algorithms.
arXiv Detail & Related papers (2024-09-30T13:55:42Z) - Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models [66.24055500785657]
Traditional turn-based chat systems prevent users from verbally interacting with system while it is generating responses.
To overcome these limitations, we adapt existing LLMs to listen users while generating output and provide users with instant feedback.
We build a dataset consisting of alternating time slices of queries and responses as well as covering typical feedback types in instantaneous interactions.
arXiv Detail & Related papers (2024-06-22T03:20:10Z) - How Reliable is Your Simulator? Analysis on the Limitations of Current LLM-based User Simulators for Conversational Recommendation [14.646529557978512]
We analyze the limitations of using Large Language Models in constructing user simulators for Conversational Recommender System.
Data leakage, which occurs in conversational history and the user simulator's replies, results in inflated evaluation results.
We propose SimpleUserSim, employing a straightforward strategy to guide the topic toward the target items.
arXiv Detail & Related papers (2024-03-25T04:21:06Z) - User-LLM: Efficient LLM Contextualization with User Embeddings [23.226164112909643]
User-LLM is a novel framework that leverages user embeddings to directly contextualize large language models with user history interactions.
Our approach achieves significant efficiency gains by representing user timelines directly as embeddings, leading to substantial inference speedups of up to 78.1X.
arXiv Detail & Related papers (2024-02-21T08:03:27Z) - On the steerability of large language models toward data-driven personas [98.9138902560793]
Large language models (LLMs) are known to generate biased responses where the opinions of certain groups and populations are underrepresented.
Here, we present a novel approach to achieve controllable generation of specific viewpoints using LLMs.
arXiv Detail & Related papers (2023-11-08T19:01:13Z) - A Cooperative Memory Network for Personalized Task-oriented Dialogue
Systems with Incomplete User Profiles [55.951126447217526]
We study personalized Task-oriented Dialogue Systems without assuming that user profiles are complete.
We propose a Cooperative Memory Network (CoMemNN) that has a novel mechanism to gradually enrich user profiles.
CoMemNN is able to enrich user profiles effectively, which results in an improvement of 3.06% in terms of response selection accuracy.
arXiv Detail & Related papers (2021-02-16T18:05:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.