Related papers: From Prompts to Constructs: A Dual-Validity Framework for LLM Research in Psychology

From Prompts to Constructs: A Dual-Validity Framework for LLM Research in Psychology

URL: http://arxiv.org/abs/2506.16697v1
Date: Fri, 20 Jun 2025 02:38:42 GMT
Title: From Prompts to Constructs: A Dual-Validity Framework for LLM Research in Psychology
Authors: Zhicheng Lin,
Abstract summary: We argue that building a robust science of AI psychology requires integrating the principles of reliable measurement and the standards for sound causal inference.<n>We present a dual-validity framework to guide this integration, which clarifies how the evidence needed to support a claim scales with its scientific ambition.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are rapidly being adopted across psychology, serving as research tools, experimental subjects, human simulators, and computational models of cognition. However, the application of human measurement tools to these systems can produce contradictory results, raising concerns that many findings are measurement phantoms--statistical artifacts rather than genuine psychological phenomena. In this Perspective, we argue that building a robust science of AI psychology requires integrating two of our field's foundational pillars: the principles of reliable measurement and the standards for sound causal inference. We present a dual-validity framework to guide this integration, which clarifies how the evidence needed to support a claim scales with its scientific ambition. Using an LLM to classify text may require only basic accuracy checks, whereas claiming it can simulate anxiety demands a far more rigorous validation process. Current practice systematically fails to meet these requirements, often treating statistical pattern matching as evidence of psychological phenomena. The same model output--endorsing "I am anxious"--requires different validation strategies depending on whether researchers claim to measure, characterize, simulate, or model psychological constructs. Moving forward requires developing computational analogues of psychological constructs and establishing clear, scalable standards of evidence rather than the uncritical application of human measurement tools.

Related papers

A validity-guided workflow for robust large language model research in psychology [0.0]
Large language models (LLMs) are rapidly being integrated into psychological research as research tools, evaluation targets, human simulators, and cognitive models.<n>These "measurement phantoms"--statistical artifacts masquerading as psychological phenomena--threaten the validity of a growing body of research.<n>Guided by the dual-validity framework that integrates psychometrics with causal inference, we present a six-stage workflow that scales validity requirements to research ambition.
arXiv Detail & Related papers (2025-07-06T18:06:12Z)
Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z)
Designing LLM-Agents with Personalities: A Psychometric Approach [0.47498241053872914]
This research introduces a novel methodology for assigning quantifiable, controllable and psychometrically validated personalities to Agents. It seeks to overcome the constraints of human subject studies, proposing Agents as an accessible tool for social science inquiry.
arXiv Detail & Related papers (2024-10-25T01:05:04Z)
Assessment and manipulation of latent constructs in pre-trained language models using psychometric scales [4.805861461250903]
We show how standard psychological questionnaires can be reformulated into natural language inference prompts.<n>We demonstrate, using a sample of 88 publicly available models, the existence of human-like mental health-related constructs.
arXiv Detail & Related papers (2024-09-29T11:00:41Z)
Between Randomness and Arbitrariness: Some Lessons for Reliable Machine Learning at Scale [2.50194939587674]
dissertation: quantifying and mitigating sources of arbitiness in ML, randomness in uncertainty estimation and optimization algorithms, in order to achieve scalability without sacrificing reliability. dissertation serves as an empirical proof by example that research on reliable measurement for machine learning is intimately bound up with research in law and policy.
arXiv Detail & Related papers (2024-06-13T19:29:37Z)
CausalGym: Benchmarking causal interpretability methods on linguistic tasks [52.61917615039112]
We use CausalGym to benchmark the ability of interpretability methods to causally affect model behaviour. We study the pythia models (14M--6.9B) and assess the causal efficacy of a wide range of interpretability methods. We find that DAS outperforms the other methods, and so we use it to study the learning trajectory of two difficult linguistic phenomena.
arXiv Detail & Related papers (2024-02-19T21:35:56Z)
Decoding Susceptibility: Modeling Misbelief to Misinformation Through a Computational Approach [61.04606493712002]
Susceptibility to misinformation describes the degree of belief in unverifiable claims that is not observable. Existing susceptibility studies heavily rely on self-reported beliefs. We propose a computational approach to model users' latent susceptibility levels.
arXiv Detail & Related papers (2023-11-16T07:22:56Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
User Behavior Simulation with Large Language Model based Agents [116.74368915420065]
We propose an LLM-based agent framework and design a sandbox environment to simulate real user behaviors. Based on extensive experiments, we find that the simulated behaviors of our method are very close to the ones of real humans.
arXiv Detail & Related papers (2023-06-05T02:58:35Z)
Empirical Estimates on Hand Manipulation are Recoverable: A Step Towards Individualized and Explainable Robotic Support in Everyday Activities [80.37857025201036]
Key challenge for robotic systems is to figure out the behavior of another agent. Processing correct inferences is especially challenging when (confounding) factors are not controlled experimentally. We propose equipping robots with the necessary tools to conduct observational studies on people.
arXiv Detail & Related papers (2022-01-27T22:15:56Z)
AGENT: A Benchmark for Core Psychological Reasoning [60.35621718321559]
Intuitive psychology is the ability to reason about hidden mental variables that drive observable actions. Despite recent interest in machine agents that reason about other agents, it is not clear if such agents learn or hold the core psychology principles that drive human reasoning. We present a benchmark consisting of procedurally generated 3D animations, AGENT, structured around four scenarios.
arXiv Detail & Related papers (2021-02-24T14:58:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.