Related papers: PersonaLens: A Benchmark for Personalization Evaluation in Conversational AI Assistants

PersonaLens: A Benchmark for Personalization Evaluation in Conversational AI Assistants

URL: http://arxiv.org/abs/2506.09902v1
Date: Wed, 11 Jun 2025 16:16:07 GMT
Title: PersonaLens: A Benchmark for Personalization Evaluation in Conversational AI Assistants
Authors: Zheng Zhao, Clara Vania, Subhradeep Kayal, Naila Khan, Shay B. Cohen, Emine Yilmaz,
Abstract summary: We introduce PersonaLens, a benchmark for evaluating personalization in task-oriented AI assistants.<n>Our benchmark features diverse user profiles equipped with rich preferences and interaction histories, along with two specialized LLM-based agents.<n>We reveal significant variability in their personalization capabilities, providing crucial insights for advancing conversational AI systems.
Score: 31.486658078902025
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have advanced conversational AI assistants. However, systematically evaluating how well these assistants apply personalization--adapting to individual user preferences while completing tasks--remains challenging. Existing personalization benchmarks focus on chit-chat, non-conversational tasks, or narrow domains, failing to capture the complexities of personalized task-oriented assistance. To address this, we introduce PersonaLens, a comprehensive benchmark for evaluating personalization in task-oriented AI assistants. Our benchmark features diverse user profiles equipped with rich preferences and interaction histories, along with two specialized LLM-based agents: a user agent that engages in realistic task-oriented dialogues with AI assistants, and a judge agent that employs the LLM-as-a-Judge paradigm to assess personalization, response quality, and task success. Through extensive experiments with current LLM assistants across diverse tasks, we reveal significant variability in their personalization capabilities, providing crucial insights for advancing conversational AI systems.

Related papers

Is Passive Expertise-Based Personalization Enough? A Case Study in AI-Assisted Test-Taking [29.26173340915243]
Expert users have different systematic preferences in task-oriented dialogues.<n>We built a version of an enterprise AI assistant with passive personalization.<n>Preliminary results indicate that passive personalization helps reduce task load and improve assistant perception.<n>These findings underscore the importance of combining active and passive personalization to optimize user experience and effectiveness in enterprise task-oriented environments.
arXiv Detail & Related papers (2025-11-28T17:21:41Z)
Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction [55.24448139349266]
We present PAL-Bench, a new benchmark designed to evaluate the personalization capabilities of service-oriented assistants in long-term user-agent interactions.<n>To improve personalized service-oriented interactions, we propose H$2$Memory, a hierarchical and heterogeneous memory framework.
arXiv Detail & Related papers (2025-11-17T14:22:32Z)
Human vs. Agent in Task-Oriented Conversations [22.743152820695588]
This work presents the first systematic comparison between large language models (LLMs)-simulated users and human users in personalized task-oriented conversations.<n>Our analysis reveals significant behavioral differences between the two user types in problem-solving approaches.
arXiv Detail & Related papers (2025-09-22T11:30:39Z)
PersonaFeedback: A Large-scale Human-annotated Benchmark For Personalization [25.45861816665351]
We introduce PersonaFeedback, a new benchmark that directly evaluates LLMs' ability to provide personalized responses.<n>Unlike existing benchmarks that require models to infer implicit user personas from historical interactions, PersonaFeedback decouples persona inference from personalization.<n> PersonaFeedback consists of 8298 human-annotated test cases, which are categorized into easy, medium, and hard tiers.
arXiv Detail & Related papers (2025-06-15T17:19:19Z)
PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time [87.99027488664282]
PersonaAgent is a framework designed to address versatile personalization tasks.<n>It integrates a personalized memory module and a personalized action module.<n>Test-time user-preference alignment strategy ensures real-time user preference alignment.
arXiv Detail & Related papers (2025-06-06T17:29:49Z)
A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations [112.81207927088117]
PersonaConvBench is a benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs)<n>We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements.
arXiv Detail & Related papers (2025-05-20T09:13:22Z)
A Survey of Personalized Large Language Models: Progress and Future Directions [86.45576419251302]
Large Language Models (LLMs) excel in handling general knowledge tasks, yet struggle with user-specific personalization.<n> Personalized Large Language Models (PLLMs) tackle these challenges by leveraging individual user data.<n>PLLMs can significantly enhance user satisfaction and have broad applications in conversational agents, systems, emotion recognition, medical assistants, and more.
arXiv Detail & Related papers (2025-02-17T07:58:31Z)
Strategic Prompting for Conversational Tasks: A Comparative Analysis of Large Language Models Across Diverse Conversational Tasks [23.34710429552906]
We evaluate the capabilities and limitations of five prevalent Large Language Models: Llama, OPT, Falcon, Alpaca, and MPT.<n>The study encompasses various conversational tasks, including reservation, empathetic response generation, mental health and legal counseling, persuasion, and negotiation.
arXiv Detail & Related papers (2024-11-26T08:21:24Z)
Aligning LLMs with Individual Preferences via Interaction [51.72200436159636]
We train large language models (LLMs) that can ''interact to align''<n>We develop a multi-turn preference dataset containing 3K+ multi-turn conversations in tree structures.<n>For evaluation, we establish the ALOE benchmark, consisting of 100 carefully selected examples and well-designed metrics to measure the customized alignment performance during conversations.
arXiv Detail & Related papers (2024-10-04T17:48:29Z)
Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text [12.879551933541345]
Large Language Models (LLMs) are capable of generating human-like conversations. Conventional metrics like BLEU and ROUGE are inadequate for capturing the subtle semantics and contextual richness of such generative outputs. We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs-as-judges.
arXiv Detail & Related papers (2024-08-17T16:01:45Z)
User Modeling Challenges in Interactive AI Assistant Systems [3.1204913702660475]
Interactive Artificial Intelligent(AI) assistant systems are designed to offer timely guidance to help human users to complete a variety tasks. One of the remaining challenges is to understand user's mental states during the task for more personalized guidance. In this work, we analyze users' mental states during task executions and investigate the capabilities and challenges for large language models to interpret user profiles for more personalized user guidance.
arXiv Detail & Related papers (2024-03-29T11:54:13Z)
TaskBench: Benchmarking Large Language Models for Task Automation [82.2932794189585]
We introduce TaskBench, a framework to evaluate the capability of large language models (LLMs) in task automation. Specifically, task decomposition, tool selection, and parameter prediction are assessed. Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation.
arXiv Detail & Related papers (2023-11-30T18:02:44Z)
PsyCoT: Psychological Questionnaire as Powerful Chain-of-Thought for Personality Detection [50.66968526809069]
We propose a novel personality detection method, called PsyCoT, which mimics the way individuals complete psychological questionnaires in a multi-turn dialogue manner. Our experiments demonstrate that PsyCoT significantly improves the performance and robustness of GPT-3.5 in personality detection.
arXiv Detail & Related papers (2023-10-31T08:23:33Z)
Decision-Oriented Dialogue for Human-AI Collaboration [62.367222979251444]
We describe a class of tasks called decision-oriented dialogues, in which AI assistants such as large language models (LMs) must collaborate with one or more humans via natural language to help them make complex decisions. We formalize three domains in which users face everyday decisions: (1) choosing an assignment of reviewers to conference papers, (2) planning a multi-step itinerary in a city, and (3) negotiating travel plans for a group of friends. For each task, we build a dialogue environment where agents receive a reward based on the quality of the final decision they reach.
arXiv Detail & Related papers (2023-05-31T17:50:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.