Related papers: Rethinking Role-Playing Evaluation: Anonymous Benchmarking and a Systematic Study of Personality Effects

Rethinking Role-Playing Evaluation: Anonymous Benchmarking and a Systematic Study of Personality Effects

URL: http://arxiv.org/abs/2603.03915v1
Date: Wed, 04 Mar 2026 10:24:02 GMT
Title: Rethinking Role-Playing Evaluation: Anonymous Benchmarking and a Systematic Study of Personality Effects
Authors: Ji-Lun Peng, Yun-Nung Chen,
Abstract summary: Large language models (LLMs) have demonstrated significant potential in developing Role-Playing Agents (RPAs)<n>Current research primarily evaluates RPAs using famous fictional characters.<n>This dependency creates a bias that limits the generalization of RPAs to unseen personas.<n>This work establishes a fairer evaluation protocol and validates a scalable, personality-enhanced framework for constructing robust RPAs.
Score: 21.364811854827163
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have demonstrated significant potential in developing Role-Playing Agents (RPAs). However, current research primarily evaluates RPAs using famous fictional characters, allowing models to rely on memory associated with character names. This dependency creates a bias that limits the generalization of RPAs to unseen personas. To address this issue, we propose an anonymous evaluation method. Experiments across multiple benchmarks reveal that anonymization significantly degrades role-playing performance, confirming that name exposure carries implicit information. Furthermore, we investigate personality augmentation to enhance role fidelity under anonymous setting. We systematically compare the efficacy of personality traits derived from human annotations versus those self-generated by the model. Our results demonstrate that incorporating personality information consistently improves RPA performance. Crucially, self-generated personalities achieve performance comparable to human-annotated ones. This work establishes a fairer evaluation protocol and validates a scalable, personality-enhanced framework for constructing robust RPAs.

Related papers

Fame Fades, Nature Remains: Disentangling the Character Identity of Role-Playing Agents [13.029517493304505]
We propose a multidimensional construct that disentangles a character into two distinct layers: textbf(1) Parametric Identity, referring to character-specific knowledge encoded from the LLM's pre-training, and textbf(2) Attributive Identity, capturing fine-grained behavioral properties such as personality traits and moral values.<n>Our findings pinpoint negative social natures as the primary bottleneck in RPA fidelity, guiding future character construction and evaluation.
arXiv Detail & Related papers (2026-01-08T08:33:40Z)
Dynamic Context Adaptation for Consistent Role-Playing Agents with Retrieval-Augmented Generations [1.1740676332544449]
We propose Amadeus, a training-free framework for role-playing agents (RPAs)<n>A Amadeus framework can significantly enhance persona consistency even when responding to questions that lie beyond a character's knowledge.
arXiv Detail & Related papers (2025-08-04T03:27:05Z)
Thinking in Character: Advancing Role-Playing Agents with Role-Aware Reasoning [46.47940531288568]
This paper introduces a novel Role-Aware Reasoning (RAR) method, which consists of two important stages: Role Identity Activation (RIA) and Reasoning Style Optimization (RSO)<n>RIA explicitly guides the model with character profiles during reasoning to counteract attention diversion, and then RSO aligns reasoning style with the character and scene via LRM distillation to mitigate style drift.
arXiv Detail & Related papers (2025-06-02T14:55:04Z)
Capturing Minds, Not Just Words: Enhancing Role-Playing Language Models with Personality-Indicative Data [58.92110996840019]
We propose to enhance role-playing language models (RPLMs) via personality-indicative data. Specifically, we leverage questions from psychological scales and distill advanced RPAs to generate dialogues that grasp the minds of characters. Experimental results validate that RPLMs trained with our dataset exhibit advanced role-playing capabilities for both general and personality-related evaluations.
arXiv Detail & Related papers (2024-06-27T06:24:00Z)
Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction [54.23208041792073]
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review. A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods. We propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels.
arXiv Detail & Related papers (2024-06-26T05:30:21Z)
How Well Can LLMs Echo Us? Evaluating AI Chatbots' Role-Play Ability with ECHO [55.25989137825992]
We introduce ECHO, an evaluative framework inspired by the Turing test. This framework engages the acquaintances of the target individuals to distinguish between human and machine-generated responses. We evaluate three role-playing LLMs using ECHO, with GPT-3.5 and GPT-4 serving as foundational models.
arXiv Detail & Related papers (2024-04-22T08:00:51Z)
LLMvsSmall Model? Large Language Model Based Text Augmentation Enhanced Personality Detection Model [58.887561071010985]
Personality detection aims to detect one's personality traits underlying in social media posts. Most existing methods learn post features directly by fine-tuning the pre-trained language models. We propose a large language model (LLM) based text augmentation enhanced personality detection model.
arXiv Detail & Related papers (2024-03-12T12:10:18Z)
PsyCoT: Psychological Questionnaire as Powerful Chain-of-Thought for Personality Detection [50.66968526809069]
We propose a novel personality detection method, called PsyCoT, which mimics the way individuals complete psychological questionnaires in a multi-turn dialogue manner. Our experiments demonstrate that PsyCoT significantly improves the performance and robustness of GPT-3.5 in personality detection.
arXiv Detail & Related papers (2023-10-31T08:23:33Z)
InCharacter: Evaluating Personality Fidelity in Role-Playing Agents through Psychological Interviews [57.04431594769461]
This paper introduces a novel perspective to evaluate the personality fidelity of RPAs with psychological scales. Experiments include various types of RPAs and LLMs, covering 32 distinct characters on 14 widely used psychological scales. With InCharacter, we show that state-of-the-art RPAs exhibit personalities highly aligned with the human-perceived personalities of the characters, achieving an accuracy up to 80.7%.
arXiv Detail & Related papers (2023-10-27T08:42:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.