PersonaGym: Evaluating Persona Agents and LLMs
- URL: http://arxiv.org/abs/2407.18416v5
- Date: Fri, 05 Sep 2025 16:33:46 GMT
- Title: PersonaGym: Evaluating Persona Agents and LLMs
- Authors: Vinay Samuel, Henry Peng Zou, Yue Zhou, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Ameet Deshpande, Karthik Narasimhan, Vishvak Murahari,
- Abstract summary: We introduce PersonaGym, the first dynamic evaluation framework for persona agents, and PersonaScore, a human-aligned automatic metric grounded in decision theory.<n>Our evaluation of 10 leading LLMs across 200 personas and 10,000 questions reveals significant advancement opportunities.
- Score: 44.058616871217595
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Persona agents, which are LLM agents conditioned to act according to an assigned persona, enable contextually rich and user aligned interactions across domains like education and healthcare. However, evaluating how faithfully these agents adhere to their personas remains a significant challenge, particularly in free-form settings that demand consistency across diverse, persona-relevant environments. We introduce PersonaGym, the first dynamic evaluation framework for persona agents, and PersonaScore, a human-aligned automatic metric grounded in decision theory that enables comprehensive large-scale evaluation. Our evaluation of 10 leading LLMs across 200 personas and 10,000 questions reveals significant advancement opportunities. For example, GPT-4.1 had the exact same PersonaScore as LLaMA-3-8b despite being a more recent and advanced closed source model. Importantly, increased model size and complexity do not necessarily enhance persona agent capabilities, underscoring the need for algorithmic and architectural innovation toward faithful, performant persona agents.
Related papers
- HumanLLM: Towards Personalized Understanding and Simulation of Human Nature [72.55730315685837]
HumanLLM is a foundation model designed for personalized understanding and simulation of individuals.<n>We first construct the Cognitive Genome, a large-scale corpus curated from real-world user data on platforms like Reddit, Twitter, Blogger, and Amazon.<n>We then formulate diverse learning tasks and perform supervised fine-tuning to empower the model to predict a wide range of individualized human behaviors, thoughts, and experiences.
arXiv Detail & Related papers (2026-01-22T09:27:27Z) - Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning [52.07170679746533]
Large Language Models (LLMs) are increasingly used to simulate human users in interactive settings such as therapy, education, and social role-play.<n>We introduce a unified framework for evaluating and improving persona consistency in LLM-generated dialogue.<n>We define three automatic metrics: prompt-to-line consistency, line-to-line consistency, and Q&A consistency, that capture different types of persona drift and validate each against human annotations.
arXiv Detail & Related papers (2025-10-31T19:40:41Z) - TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation [55.55404595177229]
Large Language Models (LLMs) are exhibiting emergent human-like abilities.<n>TwinVoice is a benchmark for assessing persona simulation across diverse real-world contexts.
arXiv Detail & Related papers (2025-10-29T14:00:42Z) - DPRF: A Generalizable Dynamic Persona Refinement Framework for Optimizing Behavior Alignment Between Personalized LLM Role-Playing Agents and Humans [28.038221167188013]
Large language model role-playing agents (LLM RPAs) aim to simulate individual human behaviors.<n>The Dynamic Persona Refinement Framework (DPRF) aims to optimize the alignment of LLM RPAs' behaviors with those of target individuals.
arXiv Detail & Related papers (2025-10-16T01:26:38Z) - PersonaFuse: A Personality Activation-Driven Framework for Enhancing Human-LLM Interactions [14.497181581363288]
PersonaFuse is a novel framework that enables Large Language Models to adapt and express different personalities.<n>Tests show PersonaFuse substantially outperforms baseline models across multiple dimensions of social-emotional intelligence.<n> PersonaFuse also delivers consistent improvements in downstream human-centered applications.
arXiv Detail & Related papers (2025-09-09T03:39:28Z) - PersonaFeedback: A Large-scale Human-annotated Benchmark For Personalization [25.45861816665351]
We introduce PersonaFeedback, a new benchmark that directly evaluates LLMs' ability to provide personalized responses.<n>Unlike existing benchmarks that require models to infer implicit user personas from historical interactions, PersonaFeedback decouples persona inference from personalization.<n> PersonaFeedback consists of 8298 human-annotated test cases, which are categorized into easy, medium, and hard tiers.
arXiv Detail & Related papers (2025-06-15T17:19:19Z) - PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time [87.99027488664282]
PersonaAgent is a framework designed to address versatile personalization tasks.<n>It integrates a personalized memory module and a personalized action module.<n>Test-time user-preference alignment strategy ensures real-time user preference alignment.
arXiv Detail & Related papers (2025-06-06T17:29:49Z) - The Traitors: Deception and Trust in Multi-Agent Language Model Simulations [0.0]
We introduce The Traitors, a multi-agent simulation framework inspired by social deduction games.<n>We develop a suite of evaluation metrics capturing deception success, trust dynamics, and collective inference quality.<n>Our initial experiments across DeepSeek-V3, GPT-4o-mini, and GPT-4o (10 runs per model) reveal a notable asymmetry.
arXiv Detail & Related papers (2025-05-19T10:01:35Z) - PAARS: Persona Aligned Agentic Retail Shoppers [2.8737584376365355]
In e-commerce, behavioral data is collected for decision making which can be costly and slow.
We propose a framework that creates synthetic shopping agents by automatically mining anonymised historical shopping data.
We showcase an initial application of our framework for automated agentic A/B testing and compare the findings to human results.
arXiv Detail & Related papers (2025-03-31T15:41:51Z) - Designing LLM-Agents with Personalities: A Psychometric Approach [0.47498241053872914]
This research introduces a novel methodology for assigning quantifiable, controllable and psychometrically validated personalities to Agents.
It seeks to overcome the constraints of human subject studies, proposing Agents as an accessible tool for social science inquiry.
arXiv Detail & Related papers (2024-10-25T01:05:04Z) - IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question Answering [10.338962367542331]
We introduce an automatic evaluation framework IQA-EVAL to achieve Interactive Question Answering Evaluations.
We also introduce a LLM-based Evaluation Agent (LEA) that can simulate human behaviors to generate interactions with IQA models.
We show that our evaluation framework with GPT-4 as the backbone model achieves a high correlation with human evaluations on the IQA task.
arXiv Detail & Related papers (2024-08-24T10:34:20Z) - PersLLM: A Personified Training Approach for Large Language Models [66.16513246245401]
We propose PersLLM, integrating psychology-grounded principles of personality: social practice, consistency, and dynamic development.
We incorporate personality traits directly into the model parameters, enhancing the model's resistance to induction, promoting consistency, and supporting the dynamic evolution of personality.
arXiv Detail & Related papers (2024-07-17T08:13:22Z) - AgentGym: Evolving Large Language Model-based Agents across Diverse Environments [116.97648507802926]
Large language models (LLMs) are considered a promising foundation to build such agents.
We take the first step towards building generally-capable LLM-based agents with self-evolution ability.
We propose AgentGym, a new framework featuring a variety of environments and tasks for broad, real-time, uni-format, and concurrent agent exploration.
arXiv Detail & Related papers (2024-06-06T15:15:41Z) - Building Better AI Agents: A Provocation on the Utilisation of Persona in LLM-based Conversational Agents [4.8916211213796394]
This paper begins by examining the rationale and implications of imbuing CAs with unique personas.
We delve into the specific applications where the implementation of a persona is not just beneficial but critical for LLM-based CAs.
The paper underscores the necessity of a nuanced approach to persona integration, highlighting the potential challenges and ethical dilemmas that may arise.
arXiv Detail & Related papers (2024-05-26T11:36:48Z) - From Persona to Personalization: A Survey on Role-Playing Language Agents [52.783043059715546]
Recent advancements in large language models (LLMs) have boosted the rise of Role-Playing Language Agents (RPLAs)
RPLAs achieve a remarkable sense of human likeness and vivid role-playing performance.
They have catalyzed numerous AI applications, such as emotional companions, interactive video games, personalized assistants and copilots.
arXiv Detail & Related papers (2024-04-28T15:56:41Z) - AgentCF: Collaborative Learning with Autonomous Language Agents for
Recommender Systems [112.76941157194544]
We propose AgentCF for simulating user-item interactions in recommender systems through agent-based collaborative filtering.
We creatively consider not only users but also items as agents, and develop a collaborative learning approach that optimize both kinds of agents together.
Overall, the optimized agents exhibit diverse interaction behaviors within our framework, including user-item, user-user, item-item, and collective interactions.
arXiv Detail & Related papers (2023-10-13T16:37:14Z) - The Rise and Potential of Large Language Model Based Agents: A Survey [91.71061158000953]
Large language models (LLMs) are regarded as potential sparks for Artificial General Intelligence (AGI)
We start by tracing the concept of agents from its philosophical origins to its development in AI, and explain why LLMs are suitable foundations for agents.
We explore the extensive applications of LLM-based agents in three aspects: single-agent scenarios, multi-agent scenarios, and human-agent cooperation.
arXiv Detail & Related papers (2023-09-14T17:12:03Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z) - Human Choice Prediction in Language-based Persuasion Games:
Simulation-based Off-Policy Evaluation [24.05034588588407]
This paper addresses a key aspect in the design of such agents: Predicting human decision in off-policy evaluation.
We collected a dataset of 87K decisions from humans playing a repeated decision-making game with artificial agents.
Our approach involves training a model on human interactions with one agents subset to predict decisions when interacting with another.
arXiv Detail & Related papers (2023-05-17T16:38:11Z) - Improving Personality Consistency in Conversation by Persona Extending [22.124187337032946]
We propose a novel retrieval-to-prediction paradigm consisting of two subcomponents, namely, Persona Retrieval Model (PRM) and Posterior-scored Transformer (PS-Transformer)
Our proposed model yields considerable improvements in both automatic metrics and human evaluations.
arXiv Detail & Related papers (2022-08-23T09:00:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.