Challenging the Validity of Personality Tests for Large Language Models
- URL: http://arxiv.org/abs/2311.05297v2
- Date: Wed, 5 Jun 2024 10:33:18 GMT
- Title: Challenging the Validity of Personality Tests for Large Language Models
- Authors: Tom Sühr, Florian E. Dorner, Samira Samadi, Augustin Kelava,
- Abstract summary: Large language models (LLMs) behave increasingly human-like in text-based interactions.
LLMs' responses to personality tests systematically deviate from human responses.
- Score: 2.9123921488295768
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With large language models (LLMs) like GPT-4 appearing to behave increasingly human-like in text-based interactions, it has become popular to attempt to evaluate personality traits of LLMs using questionnaires originally developed for humans. While reusing measures is a resource-efficient way to evaluate LLMs, careful adaptations are usually required to ensure that assessment results are valid even across human subpopulations. In this work, we provide evidence that LLMs' responses to personality tests systematically deviate from human responses, implying that the results of these tests cannot be interpreted in the same way. Concretely, reverse-coded items ("I am introverted" vs. "I am extraverted") are often both answered affirmatively. Furthermore, variation across prompts designed to "steer" LLMs to simulate particular personality types does not follow the clear separation into five independent personality factors from human samples. In light of these results, we believe that it is important to investigate tests' validity for LLMs before drawing strong conclusions about potentially ill-defined concepts like LLMs' "personality".
Related papers
- Humanity in AI: Detecting the Personality of Large Language Models [0.0]
Questionnaires are a common method for detecting the personality of Large Language Models (LLMs)
We propose combining text mining with questionnaires method.
We find that the personalities of LLMs are derived from their pre-trained data.
arXiv Detail & Related papers (2024-10-11T05:53:11Z) - Cognitive phantoms in LLMs through the lens of latent variables [0.3441021278275805]
Large language models (LLMs) increasingly reach real-world applications, necessitating a better understanding of their behaviour.
Recent studies administering psychometric questionnaires to LLMs report human-like traits in LLMs, potentially influencing behaviour.
This approach suffers from a validity problem: it presupposes that these traits exist in LLMs and that they are measurable with tools designed for humans.
This study investigates this problem by comparing latent structures of personality between humans and three LLMs using two validated personality questionnaires.
arXiv Detail & Related papers (2024-09-06T12:42:35Z) - Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis [0.27309692684728604]
We prompt OpenAI's flagship models, GPT-3.5 and GPT-4, to assume different personas and respond to a range of standardized measures of personality constructs.
We found that the responses from GPT-4, but not GPT-3.5, using generic persona descriptions show promising, albeit not perfect, psychometric properties.
arXiv Detail & Related papers (2024-05-12T10:52:15Z) - "I'm Not Sure, But...": Examining the Impact of Large Language Models' Uncertainty Expression on User Reliance and Trust [51.542856739181474]
We show how different natural language expressions of uncertainty impact participants' reliance, trust, and overall task performance.
We find that first-person expressions decrease participants' confidence in the system and tendency to agree with the system's answers, while increasing participants' accuracy.
Our findings suggest that using natural language expressions of uncertainty may be an effective approach for reducing overreliance on LLMs, but that the precise language used matters.
arXiv Detail & Related papers (2024-05-01T16:43:55Z) - LLMvsSmall Model? Large Language Model Based Text Augmentation Enhanced
Personality Detection Model [58.887561071010985]
Personality detection aims to detect one's personality traits underlying in social media posts.
Most existing methods learn post features directly by fine-tuning the pre-trained language models.
We propose a large language model (LLM) based text augmentation enhanced personality detection model.
arXiv Detail & Related papers (2024-03-12T12:10:18Z) - Relying on the Unreliable: The Impact of Language Models' Reluctance to Express Uncertainty [53.336235704123915]
We investigate how LMs incorporate confidence in responses via natural language and how downstream users behave in response to LM-articulated uncertainties.
We find that LMs are reluctant to express uncertainties when answering questions even when they produce incorrect responses.
We test the risks of LM overconfidence by conducting human experiments and show that users rely heavily on LM generations.
Lastly, we investigate the preference-annotated datasets used in post training alignment and find that humans are biased against texts with uncertainty.
arXiv Detail & Related papers (2024-01-12T18:03:30Z) - Do LLMs exhibit human-like response biases? A case study in survey
design [66.1850490474361]
We investigate the extent to which large language models (LLMs) reflect human response biases, if at all.
We design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires.
Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior.
arXiv Detail & Related papers (2023-11-07T15:40:43Z) - Self-Assessment Tests are Unreliable Measures of LLM Personality [2.887477629420772]
We analyze the reliability of personality scores obtained from self-assessment personality tests using two simple experiments.
We find that all three prompts lead to very different personality scores, a difference that is statistically significant for all traits in a large majority of scenarios.
Since most of the self-assessment tests exist in the form of multiple choice question (MCQ) questions, we argue that the scores should also be robust to the order in which the options are presented.
arXiv Detail & Related papers (2023-09-15T05:19:39Z) - Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models.
Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z) - Can ChatGPT Assess Human Personalities? A General Evaluation Framework [70.90142717649785]
Large Language Models (LLMs) have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored.
This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests.
arXiv Detail & Related papers (2023-03-01T06:16:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.