Related papers: CAPE: Context-Aware Personality Evaluation Framework for Large Language Models

CAPE: Context-Aware Personality Evaluation Framework for Large Language Models

URL: http://arxiv.org/abs/2508.20385v1
Date: Thu, 28 Aug 2025 03:17:47 GMT
Title: CAPE: Context-Aware Personality Evaluation Framework for Large Language Models
Authors: Jivnesh Sandhan, Fei Cheng, Tushar Sandhan, Yugo Murawaki,
Abstract summary: We propose the first Context-Aware Personality Evaluation framework for Large Language Models (LLMs)<n>Our experiments reveal that conversational history enhances response consistency via in-context learning but also induces personality shifts.<n>Our framework can be applied to Role Playing Agents (RPAs) to better align with human judgments.
Score: 8.618075786777219
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Psychometric tests, traditionally used to assess humans, are now being applied to Large Language Models (LLMs) to evaluate their behavioral traits. However, existing studies follow a context-free approach, answering each question in isolation to avoid contextual influence. We term this the Disney World test, an artificial setting that ignores real-world applications, where conversational history shapes responses. To bridge this gap, we propose the first Context-Aware Personality Evaluation (CAPE) framework for LLMs, incorporating prior conversational interactions. To thoroughly analyze the influence of context, we introduce novel metrics to quantify the consistency of LLM responses, a fundamental trait in human behavior. Our exhaustive experiments on 7 LLMs reveal that conversational history enhances response consistency via in-context learning but also induces personality shifts, with GPT-3.5-Turbo and GPT-4-Turbo exhibiting extreme deviations. While GPT models are robust to question ordering, Gemini-1.5-Flash and Llama-8B display significant sensitivity. Moreover, GPT models response stem from their intrinsic personality traits as well as prior interactions, whereas Gemini-1.5-Flash and Llama--8B heavily depend on prior interactions. Finally, applying our framework to Role Playing Agents (RPAs) shows context-dependent personality shifts improve response consistency and better align with human judgments. Our code and datasets are publicly available at: https://github.com/jivnesh/CAPE

Related papers

Emulating Aggregate Human Choice Behavior and Biases with GPT Conversational Agents [0.48439699124726004]
Large language models (LLMs) have been shown to reproduce well-known biases.<n>We adapted three well-established decision scenarios into a conversational setting and conducted a human experiment.<n>We found notable differences between models in how they aligned human behavior.
arXiv Detail & Related papers (2026-02-05T12:33:05Z)
Persistent Personas? Role-Playing, Instruction Following, and Safety in Extended Interactions [11.415343473837583]
Persona-assigned large language models (LLMs) are used in domains such as education, healthcare, and sociodemographic simulation.<n>We introduce an evaluation protocol that combines long persona dialogues and evaluation datasets to create dialogue-conditioned benchmarks.<n>We find that persona fidelity degrades over the course of dialogues, especially in goal-oriented conversations.
arXiv Detail & Related papers (2025-12-14T17:27:02Z)
Ask, Answer, and Detect: Role-Playing LLMs for Personality Detection with Question-Conditioned Mixture-of-Experts [4.618735978506653]
ROME is a novel framework that explicitly injects psychological knowledge into personality detection.<n>We show that ROME consistently outperforms state-of-the-art baselines in experiments on two real-world datasets.
arXiv Detail & Related papers (2025-12-09T17:07:54Z)
Evaluating LLM Alignment on Personality Inference from Real-World Interview Data [7.061237517845673]
Large Language Models (LLMs) are increasingly deployed in roles requiring nuanced psychological understanding.<n>Their ability to interpret human personality traits, a critical aspect of such applications, remains unexplored.<n>We introduce a novel benchmark comprising semi-structured interview transcripts paired with validated continuous Big Five trait scores.
arXiv Detail & Related papers (2025-09-16T16:54:35Z)
C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations [23.11314388159895]
Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users' spoken queries.<n>Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue.
arXiv Detail & Related papers (2025-07-30T17:56:23Z)
OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction [123.89581506075461]
We propose OmniCharacter, a first seamless speech-language personality interaction model to achieve immersive RPAs with low latency.<n> Specifically, OmniCharacter enables agents to consistently exhibit role-specific personality traits and vocal traits throughout the interaction.<n>Our method yields better responses in terms of both content and style compared to existing RPAs and mainstream speech-language models, with a response latency as low as 289ms.
arXiv Detail & Related papers (2025-05-26T17:55:06Z)
If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs [55.8331366739144]
We introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in large language models (LLMs)<n>Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches.
arXiv Detail & Related papers (2025-03-30T16:50:57Z)
REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation [51.97224538045096]
We introduce REALTALK, a 21-day corpus of authentic messaging app dialogues.<n>We compare EI attributes and persona consistency to understand the challenges posed by real-world dialogues.<n>Our findings reveal that models struggle to simulate a user solely from dialogue history, while fine-tuning on specific user chats improves persona emulation.
arXiv Detail & Related papers (2025-02-18T20:29:01Z)
Automated Speaking Assessment of Conversation Tests with Novel Graph-based Modeling on Spoken Response Coherence [11.217656140423207]
ASAC aims to evaluate the overall speaking proficiency of an L2 speaker in a setting where an interlocutor interacts with one or more candidates.<n>We propose a hierarchical graph model that aptly incorporates both broad inter-response interactions and nuanced semantic information.<n>Extensive experimental results on the NICT-JLE benchmark dataset suggest that our proposed modeling approach can yield considerable improvements in prediction accuracy.
arXiv Detail & Related papers (2024-09-11T07:24:07Z)
X-TURING: Towards an Enhanced and Efficient Turing Test for Long-Term Dialogue Agents [56.64615470513102]
The Turing test examines whether AIs exhibit human-like behaviour in natural language conversations.<n>Traditional setting limits each participant to one message at a time and requires constant human participation.<n>This paper proposes textbftextscX-Turing, which enhances the original test with a textitburst dialogue pattern.
arXiv Detail & Related papers (2024-08-19T09:57:28Z)
Rel-A.I.: An Interaction-Centered Approach To Measuring Human-LM Reliance [73.19687314438133]
We study how reliance is affected by contextual features of an interaction. We find that contextual characteristics significantly affect human reliance behavior. Our results show that calibration and language quality alone are insufficient in evaluating the risks of human-LM interactions.
arXiv Detail & Related papers (2024-07-10T18:00:05Z)
Large Language Models Can Infer Personality from Free-Form User Interactions [0.0]
GPT-4 can infer personality with moderate accuracy, outperforming previous approaches. Results show that the direct focus on personality assessment did not result in a less positive user experience. Preliminary analyses suggest that the accuracy of personality inferences varies only marginally across different socio-demographic subgroups.
arXiv Detail & Related papers (2024-05-19T20:33:36Z)
PsyCoT: Psychological Questionnaire as Powerful Chain-of-Thought for Personality Detection [50.66968526809069]
We propose a novel personality detection method, called PsyCoT, which mimics the way individuals complete psychological questionnaires in a multi-turn dialogue manner. Our experiments demonstrate that PsyCoT significantly improves the performance and robustness of GPT-3.5 in personality detection.
arXiv Detail & Related papers (2023-10-31T08:23:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.