Related papers: Persistent Instability in LLM's Personality Measurements: Effects of Scale, Reasoning, and Conversation History

Persistent Instability in LLM's Personality Measurements: Effects of Scale, Reasoning, and Conversation History

URL: http://arxiv.org/abs/2508.04826v1
Date: Wed, 06 Aug 2025 19:11:33 GMT
Title: Persistent Instability in LLM's Personality Measurements: Effects of Scale, Reasoning, and Conversation History
Authors: Tommaso Tosato, Saskia Helbling, Yorguin-Jose Mantilla-Ramos, Mahmood Hegazy, Alberto Tosato, David John Lemay, Irina Rish, Guillaume Dumas,
Abstract summary: Even 400B+ models exhibit substantial response variability.<n> Interventions expected to stabilize behavior, such as chain-of-thought reasoning, detailed personas instruction, inclusion of conversation history, can paradoxically increase variability.<n>For safety-critical applications requiring predictable behavior, these findings indicate that personality-based alignment strategies may be fundamentally inadequate.
Score: 7.58175460763641
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models require consistent behavioral patterns for safe deployment, yet their personality-like traits remain poorly understood. We present PERSIST (PERsonality Stability in Synthetic Text), a comprehensive evaluation framework testing 25+ open-source models (1B-671B parameters) across 500,000+ responses. Using traditional (BFI-44, SD3) and novel LLM-adapted personality instruments, we systematically vary question order, paraphrasing, personas, and reasoning modes. Our findings challenge fundamental deployment assumptions: (1) Even 400B+ models exhibit substantial response variability (SD > 0.4); (2) Minor prompt reordering alone shifts personality measurements by up to 20%; (3) Interventions expected to stabilize behavior, such as chain-of-thought reasoning, detailed personas instruction, inclusion of conversation history, can paradoxically increase variability; (4) LLM-adapted instruments show equal instability to human-centric versions, confirming architectural rather than translational limitations. This persistent instability across scales and mitigation strategies suggests current LLMs lack the foundations for genuine behavioral consistency. For safety-critical applications requiring predictable behavior, these findings indicate that personality-based alignment strategies may be fundamentally inadequate.

Related papers

LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models [51.55869466207234]
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting.<n>We introduce LLMEval-3, a framework for dynamic evaluation of LLMs.<n>LLEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run.
arXiv Detail & Related papers (2025-08-07T14:46:30Z)
Evaluating the Simulation of Human Personality-Driven Susceptibility to Misinformation with LLMs [0.18416014644193066]
Large language models (LLMs) make it possible to generate synthetic behavioural data at scale.<n>We evaluate the capacity of LLM agents, conditioned on Big-Five profiles, to reproduce personality-based variation in susceptibility to misinformation.
arXiv Detail & Related papers (2025-06-30T08:16:07Z)
Spotting Out-of-Character Behavior: Atomic-Level Evaluation of Persona Fidelity in Open-Ended Generation [16.76995815742803]
We propose an atomic-level evaluation framework that quantifies persona fidelity at a finer granularity.<n>Our three key metrics measure the degree of persona alignment and consistency within and across generations.<n>By analyzing persona fidelity across diverse tasks and personality types, we reveal how task structure and persona desirability influence model adaptability.
arXiv Detail & Related papers (2025-06-24T06:33:10Z)
A Comparative Study of Large Language Models and Human Personality Traits [6.354326674890978]
Large Language Models (LLMs) have demonstrated human-like capabilities in language comprehension and generation.<n>This study investigates whether LLMs exhibit personality-like traits and how these traits compare with human personality.
arXiv Detail & Related papers (2025-05-01T15:10:15Z)
Self-Evolving Critique Abilities in Large Language Models [59.861013614500024]
This paper explores enhancing critique abilities of Large Language Models (LLMs)<n>We introduce SCRIT, a framework that trains LLMs with self-generated data to evolve their critique abilities.<n>Our analysis reveals that SCRIT's performance scales positively with data and model size.
arXiv Detail & Related papers (2025-01-10T05:51:52Z)
Rediscovering the Latent Dimensions of Personality with Large Language Models as Trait Descriptors [4.814107439144414]
We introduce a novel approach that uncovers latent personality dimensions in large language models (LLMs) Our experiments show that LLMs "rediscover" core personality traits such as extraversion, agreeableness, conscientiousness, neuroticism, and openness without relying on direct questionnaire inputs. We can use the derived principal components to assess personality along the Big Five dimensions, and achieve improvements in average personality prediction accuracy of up to 5% over fine-tuned models.
arXiv Detail & Related papers (2024-09-16T00:24:40Z)
Evaluating Interventional Reasoning Capabilities of Large Language Models [58.52919374786108]
Large language models (LLMs) are used to automate decision-making tasks.<n>In this paper, we evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention.<n>We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types.<n>These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts.
arXiv Detail & Related papers (2024-04-08T14:15:56Z)
LLMs Simulate Big Five Personality Traits: Further Evidence [51.13560635563004]
We analyze the personality traits simulated by Llama2, GPT4, and Mixtral. This contributes to the broader understanding of the capabilities of LLMs to simulate personality traits.
arXiv Detail & Related papers (2024-01-31T13:45:25Z)
Decoding Susceptibility: Modeling Misbelief to Misinformation Through a Computational Approach [61.04606493712002]
Susceptibility to misinformation describes the degree of belief in unverifiable claims that is not observable. Existing susceptibility studies heavily rely on self-reported beliefs. We propose a computational approach to model users' latent susceptibility levels.
arXiv Detail & Related papers (2023-11-16T07:22:56Z)
Personality Traits in Large Language Models [42.31355340867784]
Personality is a key factor determining the effectiveness of communication.<n>We present a novel and comprehensive psychometrically valid and reliable methodology for administering and validating personality tests on widely-used large language models.<n>We discuss the application and ethical implications of the measurement and shaping method, in particular regarding responsible AI.
arXiv Detail & Related papers (2023-07-01T00:58:51Z)
Personality testing of Large Language Models: Limited temporal stability, but highlighted prosociality [0.0]
Large Language Models (LLMs) continue to gain popularity due to their human-like traits and the intimacy they offer to users. This study aimed to assess the temporal stability and inter-rater agreement on their responses on personality instruments in two time points. The findings revealed varying levels of inter-rater agreement in the LLMs responses over a short time.
arXiv Detail & Related papers (2023-06-07T10:14:17Z)
Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models. Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z)
Empirical Estimates on Hand Manipulation are Recoverable: A Step Towards Individualized and Explainable Robotic Support in Everyday Activities [80.37857025201036]
Key challenge for robotic systems is to figure out the behavior of another agent. Processing correct inferences is especially challenging when (confounding) factors are not controlled experimentally. We propose equipping robots with the necessary tools to conduct observational studies on people.
arXiv Detail & Related papers (2022-01-27T22:15:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.