Related papers: PTCBENCH: Benchmarking Contextual Stability of Personality Traits in LLM Systems

PTCBENCH: Benchmarking Contextual Stability of Personality Traits in LLM Systems

URL: http://arxiv.org/abs/2602.00016v1
Date: Mon, 12 Jan 2026 18:15:50 GMT
Title: PTCBENCH: Benchmarking Contextual Stability of Personality Traits in LLM Systems
Authors: Jiongchi Yu, Yuhan Ma, Xiaoyu Zhang, Junjie Wang, Qiang Hu, Chao Shen, Xiaofei Xie,
Abstract summary: We introduce PTCBENCH, a benchmark designed to quantify the consistency of large language models (LLMs) personalities under controlled situational contexts.<n> PTCBENCH subjects models to 12 distinct external conditions spanning diverse location contexts and life events, and rigorously assesses the personality using the NEO Five-Factor Inventory.<n>Our study on 39,240 personality trait records reveals that certain external scenarios can trigger significant personality changes of LLMs, and even alter their reasoning capabilities.
Score: 30.449659477704543
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the increasing deployment of large language models (LLMs) in affective agents and AI systems, maintaining a consistent and authentic LLM personality becomes critical for user trust and engagement. However, existing work overlooks a fundamental psychological consensus that personality traits are dynamic and context-dependent. To bridge this gap, we introduce PTCBENCH, a systematic benchmark designed to quantify the consistency of LLM personalities under controlled situational contexts. PTCBENCH subjects models to 12 distinct external conditions spanning diverse location contexts and life events, and rigorously assesses the personality using the NEO Five-Factor Inventory. Our study on 39,240 personality trait records reveals that certain external scenarios (e.g., "Unemployment") can trigger significant personality changes of LLMs, and even alter their reasoning capabilities. Overall, PTCBENCH establishes an extensible framework for evaluating personality consistency in realistic, evolving environments, offering actionable insights for developing robust and psychologically aligned AI systems.

Related papers

LiveCultureBench: a Multi-Agent, Multi-Cultural Benchmark for Large Language Models in Dynamic Social Simulations [63.478832978278014]
Large language models (LLMs) are increasingly deployed as autonomous agents, yet evaluations focus primarily on task success rather than cultural appropriateness or evaluator reliability.<n>We introduce LiveCultureBench, a multi-cultural, dynamic benchmark that embeds LLMs as agents in a simulated town and evaluates them on both task completion and adherence to socio-cultural norms.
arXiv Detail & Related papers (2026-03-02T15:04:16Z)
Structured Personality Control and Adaptation for LLM Agents [11.050618253938126]
Large Language Models (LLMs) are increasingly shaping human-computer interaction (HCI)<n>We present a framework that models LLM personality via Jungian psychological types.<n>This design allows the agent to maintain nuanced traits while dynamically adjusting to interaction demands.
arXiv Detail & Related papers (2026-01-15T03:15:24Z)
DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios [57.327907850766785]
characterization of deception across realistic real-world scenarios remains underexplored.<n>We establish DeceptionBench, the first benchmark that systematically evaluates how deceptive tendencies manifest across different domains.<n>On the intrinsic dimension, we explore whether models exhibit self-interested egoistic tendencies or sycophantic behaviors that prioritize user appeasement.<n>We incorporate sustained multi-turn interaction loops to construct a more realistic simulation of real-world feedback dynamics.
arXiv Detail & Related papers (2025-10-17T10:14:26Z)
IROTE: Human-like Traits Elicitation of Large Language Model via In-Context Self-Reflective Optimization [66.6349183886101]
We propose IROTE, a novel in-context method for stable and transferable trait elicitation.<n>We show that one single IROTE-generated self-reflection can induce LLMs' stable impersonation of the target trait across diverse downstream tasks.
arXiv Detail & Related papers (2025-08-12T08:04:28Z)
Persistent Instability in LLM's Personality Measurements: Effects of Scale, Reasoning, and Conversation History [7.58175460763641]
Even 400B+ models exhibit substantial response variability.<n> Interventions expected to stabilize behavior, such as chain-of-thought reasoning, detailed personas instruction, inclusion of conversation history, can paradoxically increase variability.<n>For safety-critical applications requiring predictable behavior, these findings indicate that personality-based alignment strategies may be fundamentally inadequate.
arXiv Detail & Related papers (2025-08-06T19:11:33Z)
A Comparative Study of Large Language Models and Human Personality Traits [6.354326674890978]
Large Language Models (LLMs) have demonstrated human-like capabilities in language comprehension and generation.<n>This study investigates whether LLMs exhibit personality-like traits and how these traits compare with human personality.
arXiv Detail & Related papers (2025-05-01T15:10:15Z)
Beyond Self-Reports: Multi-Observer Agents for Personality Assessment in Large Language Models [2.7010154811483167]
This paper proposes a novel multi-observer framework for personality trait assessments in LLM agents.<n>Instead of relying on self-assessments, we employ multiple observer agents.<n>We show that these observer-report ratings align more closely with human judgments than traditional self-assessments.
arXiv Detail & Related papers (2025-04-11T10:03:55Z)
Exploring the Impact of Personality Traits on Conversational Recommender Systems: A Simulation with Large Language Models [70.180385882195]
This paper introduces a personality-aware user simulation for Conversational Recommender Systems (CRSs)<n>The user agent induces customizable personality traits and preferences, while the system agent possesses the persuasion capability to simulate realistic interaction in CRSs.<n> Experimental results demonstrate that state-of-the-art LLMs can effectively generate diverse user responses aligned with specified personality traits.
arXiv Detail & Related papers (2025-04-09T13:21:17Z)
Personality Editing for Language Models through Adjusting Self-Referential Queries [17.051166122108857]
We present PALETTE (Personality Adjustment by LLM SElf-TargeTed quEries), a novel method for personality editing in Large Language Models (LLMs)<n>Our approach introduces adjustment queries, where self-referential statements grounded in psychological constructs are treated analogously to factual knowledge, enabling direct editing of personality-related responses.<n>Unlike fine-tuning, PALETTE requires only 12 editing samples to achieve substantial improvements in personality alignment across personality dimensions.
arXiv Detail & Related papers (2025-02-17T13:28:14Z)
Exploring the Personality Traits of LLMs through Latent Features Steering [12.142248881876355]
We investigate how factors, such as cultural norms and environmental stressors, encoded within large language models (LLMs) shape their personality traits.<n>We propose a training-free approach to modify the model's behavior by extracting and steering latent features corresponding to factors within the model.
arXiv Detail & Related papers (2024-10-07T21:02:34Z)
PersLLM: A Personified Training Approach for Large Language Models [66.16513246245401]
We propose PersLLM, a framework for better data construction and model tuning.<n>For insufficient data usage, we incorporate strategies such as Chain-of-Thought prompting and anti-induction.<n>For rigid behavior patterns, we design the tuning process and introduce automated DPO to enhance the specificity and dynamism of the models' personalities.
arXiv Detail & Related papers (2024-07-17T08:13:22Z)
Evaluating Large Language Models with Psychometrics [59.821829073478376]
This paper offers a comprehensive benchmark for quantifying psychological constructs of Large Language Models (LLMs)<n>Our work identifies five key psychological constructs -- personality, values, emotional intelligence, theory of mind, and self-efficacy -- assessed through a suite of 13 datasets.<n>We uncover significant discrepancies between LLMs' self-reported traits and their response patterns in real-world scenarios, revealing complexities in their behaviors.
arXiv Detail & Related papers (2024-06-25T16:09:08Z)
Personality Traits in Large Language Models [42.31355340867784]
Personality is a key factor determining the effectiveness of communication.<n>We present a novel and comprehensive psychometrically valid and reliable methodology for administering and validating personality tests on widely-used large language models.<n>We discuss the application and ethical implications of the measurement and shaping method, in particular regarding responsible AI.
arXiv Detail & Related papers (2023-07-01T00:58:51Z)
Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models. Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.