Can ChatGPT Assess Human Personalities? A General Evaluation Framework
- URL: http://arxiv.org/abs/2303.01248v3
- Date: Fri, 13 Oct 2023 15:53:00 GMT
- Title: Can ChatGPT Assess Human Personalities? A General Evaluation Framework
- Authors: Haocong Rao, Cyril Leung, Chunyan Miao
- Abstract summary: Large Language Models (LLMs) have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored.
This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests.
- Score: 70.90142717649785
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) especially ChatGPT have produced impressive
results in various areas, but their potential human-like psychology is still
largely unexplored. Existing works study the virtual personalities of LLMs but
rarely explore the possibility of analyzing human personalities via LLMs. This
paper presents a generic evaluation framework for LLMs to assess human
personalities based on Myers Briggs Type Indicator (MBTI) tests. Specifically,
we first devise unbiased prompts by randomly permuting options in MBTI
questions and adopt the average testing result to encourage more impartial
answer generation. Then, we propose to replace the subject in question
statements to enable flexible queries and assessments on different subjects
from LLMs. Finally, we re-formulate the question instructions in a manner of
correctness evaluation to facilitate LLMs to generate clearer responses. The
proposed framework enables LLMs to flexibly assess personalities of different
groups of people. We further propose three evaluation metrics to measure the
consistency, robustness, and fairness of assessment results from
state-of-the-art LLMs including ChatGPT and GPT-4. Our experiments reveal
ChatGPT's ability to assess human personalities, and the average results
demonstrate that it can achieve more consistent and fairer assessments in spite
of lower robustness against prompt biases compared with InstructGPT.
Related papers
- Humanity in AI: Detecting the Personality of Large Language Models [0.0]
Questionnaires are a common method for detecting the personality of Large Language Models (LLMs)
We propose combining text mining with questionnaires method.
We find that the personalities of LLMs are derived from their pre-trained data.
arXiv Detail & Related papers (2024-10-11T05:53:11Z) - Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis [0.27309692684728604]
We prompt OpenAI's flagship models, GPT-3.5 and GPT-4, to assume different personas and respond to a range of standardized measures of personality constructs.
We found that the responses from GPT-4, but not GPT-3.5, using generic persona descriptions show promising, albeit not perfect, psychometric properties.
arXiv Detail & Related papers (2024-05-12T10:52:15Z) - Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition [46.949604465227054]
We propose a sample-efficient human evaluation method based on MAximum Discrepancy (MAD) competition.
MAD automatically selects a small set of informative and diverse instructions, each adapted to two LLMs.
The pairwise comparison results are then aggregated into a global ranking using the Elo rating system.
arXiv Detail & Related papers (2024-04-10T01:26:24Z) - Challenging the Validity of Personality Tests for Large Language Models [2.9123921488295768]
Large language models (LLMs) behave increasingly human-like in text-based interactions.
LLMs' responses to personality tests systematically deviate from human responses.
arXiv Detail & Related papers (2023-11-09T11:54:01Z) - PsyCoT: Psychological Questionnaire as Powerful Chain-of-Thought for
Personality Detection [50.66968526809069]
We propose a novel personality detection method, called PsyCoT, which mimics the way individuals complete psychological questionnaires in a multi-turn dialogue manner.
Our experiments demonstrate that PsyCoT significantly improves the performance and robustness of GPT-3.5 in personality detection.
arXiv Detail & Related papers (2023-10-31T08:23:33Z) - Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models.
Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z) - PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits [30.770525830385637]
We study the behavior of large language models (LLMs) based on the Big Five personality model.
Results show that LLM personas' self-reported BFI scores are consistent with their designated personality types.
Human evaluation shows that humans can perceive some personality traits with an accuracy of up to 80%.
arXiv Detail & Related papers (2023-05-04T04:58:00Z) - Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.
We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z) - ElitePLM: An Empirical Study on General Language Ability Evaluation of
Pretrained Language Models [78.08792285698853]
We present a large-scale empirical study on general language ability evaluation of pretrained language models (ElitePLM)
Our empirical results demonstrate that: (1) PLMs with varying training objectives and strategies are good at different ability tests; (2) fine-tuning PLMs in downstream tasks is usually sensitive to the data size and distribution; and (3) PLMs have excellent transferability between similar tasks.
arXiv Detail & Related papers (2022-05-03T14:18:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.