Related papers: Measuring How LLMs Internalize Human Psychological Concepts: A preliminary analysis

Measuring How LLMs Internalize Human Psychological Concepts: A preliminary analysis

URL: http://arxiv.org/abs/2506.23055v1
Date: Sun, 29 Jun 2025 01:56:56 GMT
Title: Measuring How LLMs Internalize Human Psychological Concepts: A preliminary analysis
Authors: Hiro Taiyo Hamada, Ippei Fujisawa, Genji Kawakita, Yuki Yamada,
Abstract summary: We develop a framework to assess concept alignment between Large Language Models and human psychological dimensions.<n>A GPT-4 model achieved superior classification accuracy (66.2%), significantly outperforming GPT-3.5 (55.9%) and BERT (48.1%)<n>Our findings demonstrate that modern LLMs can approximate human psychological constructs with measurable accuracy.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) such as ChatGPT have shown remarkable abilities in producing human-like text. However, it is unclear how accurately these models internalize concepts that shape human thought and behavior. Here, we developed a quantitative framework to assess concept alignment between LLMs and human psychological dimensions using 43 standardized psychological questionnaires, selected for their established validity in measuring distinct psychological constructs. Our method evaluates how accurately language models reconstruct and classify questionnaire items through pairwise similarity analysis. We compared resulting cluster structures with the original categorical labels using hierarchical clustering. A GPT-4 model achieved superior classification accuracy (66.2\%), significantly outperforming GPT-3.5 (55.9\%) and BERT (48.1\%), all exceeding random baseline performance (31.9\%). We also demonstrated that the estimated semantic similarity from GPT-4 is associated with Pearson's correlation coefficients of human responses in multiple psychological questionnaires. This framework provides a novel approach to evaluate the alignment of the human-LLM concept and identify potential representational biases. Our findings demonstrate that modern LLMs can approximate human psychological constructs with measurable accuracy, offering insights for developing more interpretable AI systems.

Related papers

Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement [16.608577295968942]
The rapid advancement of large language models (LLMs) has outpaced traditional evaluation methodologies.<n>Psychometrics is the science of quantifying the intangible aspects of human psychology, such as personality, values, and intelligence.<n>This survey introduces and synthesizes an emerging interdisciplinary field of LLM Psychometrics.
arXiv Detail & Related papers (2025-05-13T05:47:51Z)
Comparing Human Expertise and Large Language Models Embeddings in Content Validity Assessment of Personality Tests [0.0]
We explore the application of Large Language Models (LLMs) in assessing the content validity of psychometric instruments.<n>Using both human expert evaluations and advanced LLMs, we compared the accuracy of semantic item-construct alignment.<n>The results reveal distinct strengths and limitations of human and AI approaches.
arXiv Detail & Related papers (2025-03-15T10:54:35Z)
Beyond correlation: The Impact of Human Uncertainty in Measuring the Effectiveness of Automatic Evaluation and LLM-as-a-Judge [51.93909886542317]
We show how *relying on a single aggregate correlation score* can obscure fundamental differences between human labels and those from automatic evaluation.<n>We propose stratifying data by human label uncertainty to provide a more robust analysis of automatic evaluation performance.
arXiv Detail & Related papers (2024-10-03T03:08:29Z)
Assessment and manipulation of latent constructs in pre-trained language models using psychometric scales [4.805861461250903]
We show how standard psychological questionnaires can be reformulated into natural language inference prompts.<n>We demonstrate, using a sample of 88 publicly available models, the existence of human-like mental health-related constructs.
arXiv Detail & Related papers (2024-09-29T11:00:41Z)
Idiographic Personality Gaussian Process for Psychological Assessment [7.394943089551214]
We develop a novel measurement framework based on a Gaussian process coregionalization model to address a long-lasting debate ins. We propose the idiographic personality Gaussian process (IPGP) framework, an intermediate model that accommodates both shared trait structure across a population and "idiographic" deviations for individuals.
arXiv Detail & Related papers (2024-07-06T06:09:04Z)
ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking. We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert. We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z)
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation [60.65820977963331]
We introduce a novel evaluation paradigm for Large Language Models (LLMs) This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation. By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
arXiv Detail & Related papers (2023-12-28T15:49:43Z)
PsyCoT: Psychological Questionnaire as Powerful Chain-of-Thought for Personality Detection [50.66968526809069]
We propose a novel personality detection method, called PsyCoT, which mimics the way individuals complete psychological questionnaires in a multi-turn dialogue manner. Our experiments demonstrate that PsyCoT significantly improves the performance and robustness of GPT-3.5 in personality detection.
arXiv Detail & Related papers (2023-10-31T08:23:33Z)
Investigating Large Language Models' Perception of Emotion Using Appraisal Theory [3.0902630634005797]
Large Language Models (LLM) have significantly advanced in recent years and are now being used by the general public. In this work, we investigate their emotion perception through the lens of appraisal and coping theory. We applied SCPQ to three recent LLMs from OpenAI, davinci-003, ChatGPT, and GPT-4 and compared the results with predictions from the appraisal theory and human data.
arXiv Detail & Related papers (2023-10-03T16:34:47Z)
Learning Evaluation Models from Large Language Models for Sequence Generation [61.8421748792555]
We propose a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development.<n> Experimental results on the SummEval benchmark demonstrate that CSEM can effectively train an evaluation model without human-labeled data.
arXiv Detail & Related papers (2023-08-08T16:41:16Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models. Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.