Related papers: Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality

Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality

URL: http://arxiv.org/abs/2510.11254v1
Date: Mon, 13 Oct 2025 10:43:49 GMT
Title: Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality
Authors: Jana Jung, Marlene Lutz, Indira Sen, Markus Strohmaier,
Abstract summary: Psychometric tests are increasingly used to assess psychological constructs in large language models (LLMs)<n>In this study, we evaluate the reliability and validity of human psychometric tests for three constructs: sexism, racism, and morality.<n>We find that psychometric test scores do not align, and in some cases even negatively correlate with, model behavior in downstream tasks, indicating low ecological validity.
Score: 7.68863194266262
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Psychometric tests are increasingly used to assess psychological constructs in large language models (LLMs). However, it remains unclear whether these tests -- originally developed for humans -- yield meaningful results when applied to LLMs. In this study, we systematically evaluate the reliability and validity of human psychometric tests for three constructs: sexism, racism, and morality. We find moderate reliability across multiple item and prompt variations. Validity is evaluated through both convergent (i.e., testing theory-based inter-test correlations) and ecological approaches (i.e., testing the alignment between tests scores and behavior in real-world downstream tasks). Crucially, we find that psychometric test scores do not align, and in some cases even negatively correlate with, model behavior in downstream tasks, indicating low ecological validity. Our results highlight that systematic evaluations of psychometric tests is essential before interpreting their scores. They also suggest that psychometric tests designed for humans cannot be applied directly to LLMs without adaptation.

Related papers

Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead [2.809966405091883]
We argue that interpreting benchmark performance as measurements of human-like traits, lacks sufficient theoretical and empirical justification.<n>We call for the development of principled, AI-specific evaluation frameworks tailored to AI systems.
arXiv Detail & Related papers (2025-07-30T18:14:35Z)
From Prompts to Constructs: A Dual-Validity Framework for LLM Research in Psychology [0.0]
We argue that building a robust science of AI psychology requires integrating the principles of reliable measurement and the standards for sound causal inference.<n>We present a dual-validity framework to guide this integration, which clarifies how the evidence needed to support a claim scales with its scientific ambition.
arXiv Detail & Related papers (2025-06-20T02:38:42Z)
Do LLMs Give Psychometrically Plausible Responses in Educational Assessments? [24.31027563947265]
Knowing how test takers answer items in educational assessments is essential for test development.<n>If large language models (LLMs) exhibit human-like response behavior to test items, this could open up the possibility of using them as pilot participants to accelerate test development.
arXiv Detail & Related papers (2025-06-11T14:41:10Z)
TestAgent: An Adaptive and Intelligent Expert for Human Assessment [62.060118490577366]
We propose TestAgent, a large language model (LLM)-powered agent designed to enhance adaptive testing through interactive engagement.<n>TestAgent supports personalized question selection, captures test-takers' responses and anomalies, and provides precise outcomes through dynamic, conversational interactions.
arXiv Detail & Related papers (2025-06-03T16:07:54Z)
Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests [87.0481906768826]
We examine three evaluation paradigms: standard benchmarks, interactive games, and cognitive tests.<n>Our analyses reveal that interactive games are superior to standard benchmarks in discriminating models.<n>We advocate for the development of new interactive benchmarks and targeted cognitive tasks inspired by assessing human abilities.
arXiv Detail & Related papers (2025-02-20T08:36:58Z)
Do Test and Environmental Complexity Increase Flakiness? An Empirical Study of SAP HANA [47.29324864511411]
Flaky tests fail seemingly at random without changes to the code. We study characteristics of tests and the test environment that potentially impact test flakiness.
arXiv Detail & Related papers (2024-09-16T07:52:09Z)
Evaluating Large Language Models with Psychometrics [59.821829073478376]
This paper offers a comprehensive benchmark for quantifying psychological constructs of Large Language Models (LLMs)<n>Our work identifies five key psychological constructs -- personality, values, emotional intelligence, theory of mind, and self-efficacy -- assessed through a suite of 13 datasets.<n>We uncover significant discrepancies between LLMs' self-reported traits and their response patterns in real-world scenarios, revealing complexities in their behaviors.
arXiv Detail & Related papers (2024-06-25T16:09:08Z)
Self-Assessment Tests are Unreliable Measures of LLM Personality [2.887477629420772]
We analyze the reliability of personality scores obtained from self-assessment personality tests using two simple experiments. We find that all three prompts lead to very different personality scores, a difference that is statistically significant for all traits in a large majority of scenarios. Since most of the self-assessment tests exist in the form of multiple choice question (MCQ) questions, we argue that the scores should also be robust to the order in which the options are presented.
arXiv Detail & Related papers (2023-09-15T05:19:39Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.