Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis
- URL: http://arxiv.org/abs/2405.07248v1
- Date: Sun, 12 May 2024 10:52:15 GMT
- Title: Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis
- Authors: Nikolay B Petrov, Gregory Serapio-GarcĂa, Jason Rentfrow,
- Abstract summary: We prompt OpenAI's flagship models, GPT-3.5 and GPT-4, to assume different personas and respond to a range of standardized measures of personality constructs.
We found that the responses from GPT-4, but not GPT-3.5, using generic persona descriptions show promising, albeit not perfect, psychometric properties.
- Score: 0.27309692684728604
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The humanlike responses of large language models (LLMs) have prompted social scientists to investigate whether LLMs can be used to simulate human participants in experiments, opinion polls and surveys. Of central interest in this line of research has been mapping out the psychological profiles of LLMs by prompting them to respond to standardized questionnaires. The conflicting findings of this research are unsurprising given that mapping out underlying, or latent, traits from LLMs' text responses to questionnaires is no easy task. To address this, we use psychometrics, the science of psychological measurement. In this study, we prompt OpenAI's flagship models, GPT-3.5 and GPT-4, to assume different personas and respond to a range of standardized measures of personality constructs. We used two kinds of persona descriptions: either generic (four or five random person descriptions) or specific (mostly demographics of actual humans from a large-scale human dataset). We found that the responses from GPT-4, but not GPT-3.5, using generic persona descriptions show promising, albeit not perfect, psychometric properties, similar to human norms, but the data from both LLMs when using specific demographic profiles, show poor psychometrics properties. We conclude that, currently, when LLMs are asked to simulate silicon personas, their responses are poor signals of potentially underlying latent traits. Thus, our work casts doubt on LLMs' ability to simulate individual-level human behaviour across multiple-choice question answering tasks.
Related papers
- Humanity in AI: Detecting the Personality of Large Language Models [0.0]
Questionnaires are a common method for detecting the personality of Large Language Models (LLMs)
We propose combining text mining with questionnaires method.
We find that the personalities of LLMs are derived from their pre-trained data.
arXiv Detail & Related papers (2024-10-11T05:53:11Z) - Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models [57.518784855080334]
Large Language Models (LLMs) have demonstrated exceptional task-solving capabilities, increasingly adopting roles akin to human-like assistants.
This paper presents a framework for investigating psychology dimension in LLMs, including psychological identification, assessment dataset curation, and assessment with results validation.
We introduce a comprehensive psychometrics benchmark for LLMs that covers six psychological dimensions: personality, values, emotion, theory of mind, motivation, and intelligence.
arXiv Detail & Related papers (2024-06-25T16:09:08Z) - Modeling Human Subjectivity in LLMs Using Explicit and Implicit Human Factors in Personas [14.650234624251716]
Large language models (LLMs) are increasingly being used in human-centered social scientific tasks.
These tasks are highly subjective and dependent on human factors, such as one's environment, attitudes, beliefs, and lived experiences.
We examine the role of prompting LLMs with human-like personas and ask the models to answer as if they were a specific human.
arXiv Detail & Related papers (2024-06-20T16:24:07Z) - Large Language Models Show Human-like Social Desirability Biases in Survey Responses [12.767606361552684]
We show that Large Language Models (LLMs) skew their scores towards the desirable ends of trait dimensions when personality evaluation is inferred.
This bias exists in all tested models, including GPT-4/3.5, Claude 3, Llama 3, and PaLM-2.
reverse-coding all the questions decreases bias levels but does not eliminate them, suggesting that this effect cannot be attributed to acquiescence bias.
arXiv Detail & Related papers (2024-05-09T19:02:53Z) - Challenging the Validity of Personality Tests for Large Language Models [2.9123921488295768]
Large language models (LLMs) behave increasingly human-like in text-based interactions.
LLMs' responses to personality tests systematically deviate from human responses.
arXiv Detail & Related papers (2023-11-09T11:54:01Z) - Do LLMs exhibit human-like response biases? A case study in survey
design [66.1850490474361]
We investigate the extent to which large language models (LLMs) reflect human response biases, if at all.
We design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires.
Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior.
arXiv Detail & Related papers (2023-11-07T15:40:43Z) - PsyCoT: Psychological Questionnaire as Powerful Chain-of-Thought for
Personality Detection [50.66968526809069]
We propose a novel personality detection method, called PsyCoT, which mimics the way individuals complete psychological questionnaires in a multi-turn dialogue manner.
Our experiments demonstrate that PsyCoT significantly improves the performance and robustness of GPT-3.5 in personality detection.
arXiv Detail & Related papers (2023-10-31T08:23:33Z) - Editing Personality for Large Language Models [73.59001811199823]
This paper introduces an innovative task focused on editing the personality traits of Large Language Models (LLMs)
We construct PersonalityEdit, a new benchmark dataset to address this task.
arXiv Detail & Related papers (2023-10-03T16:02:36Z) - Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models.
Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z) - Can ChatGPT Assess Human Personalities? A General Evaluation Framework [70.90142717649785]
Large Language Models (LLMs) have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored.
This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests.
arXiv Detail & Related papers (2023-03-01T06:16:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.