A validity-guided workflow for robust large language model research in psychology
- URL: http://arxiv.org/abs/2507.04491v1
- Date: Sun, 06 Jul 2025 18:06:12 GMT
- Title: A validity-guided workflow for robust large language model research in psychology
- Authors: Zhicheng Lin,
- Abstract summary: Large language models (LLMs) are rapidly being integrated into psychological research as research tools, evaluation targets, human simulators, and cognitive models.<n>These "measurement phantoms"--statistical artifacts masquerading as psychological phenomena--threaten the validity of a growing body of research.<n>Guided by the dual-validity framework that integrates psychometrics with causal inference, we present a six-stage workflow that scales validity requirements to research ambition.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are rapidly being integrated into psychological research as research tools, evaluation targets, human simulators, and cognitive models. However, recent evidence reveals severe measurement unreliability: Personality assessments collapse under factor analysis, moral preferences reverse with punctuation changes, and theory-of-mind accuracy varies widely with trivial rephrasing. These "measurement phantoms"--statistical artifacts masquerading as psychological phenomena--threaten the validity of a growing body of research. Guided by the dual-validity framework that integrates psychometrics with causal inference, we present a six-stage workflow that scales validity requirements to research ambition--using LLMs to code text requires basic reliability and accuracy, while claims about psychological properties demand comprehensive construct validation. Researchers must (1) explicitly define their research goal and corresponding validity requirements, (2) develop and validate computational instruments through psychometric testing, (3) design experiments that control for computational confounds, (4) execute protocols with transparency, (5) analyze data using methods appropriate for non-independent observations, and (6) report findings within demonstrated boundaries and use results to refine theory. We illustrate the workflow through an example of model evaluation--"LLM selfhood"--showing how systematic validation can distinguish genuine computational phenomena from measurement artifacts. By establishing validated computational instruments and transparent practices, this workflow provides a path toward building a robust empirical foundation for AI psychology research.
Related papers
- Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval [60.25608870901428]
Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs)<n>We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source robustness.
arXiv Detail & Related papers (2026-03-05T18:42:51Z) - An Investigation on How AI-Generated Responses Affect SoftwareEngineering Surveys [3.183470571353323]
This study explores how large language models (LLMs) are being misused in software engineering surveys.<n>We analyzed data from two survey deployments conducted in 2025 through the Prolific platform.<n>We identify data authenticity as an emerging dimension of validity in software engineering surveys.
arXiv Detail & Related papers (2025-12-19T11:17:05Z) - Cognitive Foundations for Reasoning and Their Manifestation in LLMs [63.12951576410617]
Large language models (LLMs) solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning.<n>We synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning reasoning invariants, meta-cognitive controls, representations for organizing reasoning & knowledge, and transformation operations.<n>We develop test-time reasoning guidance that automatically scaffold successful structures, improving performance by up to 66.7% on complex problems.
arXiv Detail & Related papers (2025-11-20T18:59:00Z) - From Five Dimensions to Many: Large Language Models as Precise and Interpretable Psychological Profilers [14.983442449498739]
This study investigates whether and how Large Language Models can model the correlational structure of human psychological traits from minimal quantitative inputs.<n>We prompted various LLMs with Big Five Personality Scale responses from 816 human individuals to role-play their responses on nine other psychological scales.<n>LLMs demonstrated remarkable accuracy in capturing human psychological structure.
arXiv Detail & Related papers (2025-11-05T06:51:13Z) - Scaling Law in LLM Simulated Personality: More Detailed and Realistic Persona Profile Is All You Need [17.298070053011802]
This research focuses on using large language models (LLMs) to simulate social experiments, exploring their ability to emulate human personality in virtual persona role-playing.<n>The research develops an end-to-end evaluation framework, including individual-level analysis of stability and identifiability.
arXiv Detail & Related papers (2025-10-10T05:52:07Z) - AICO: Feature Significance Tests for Supervised Learning [0.5142666700569699]
This paper develops model- and distribution-agnostic significance tests to assess the influence of input features in any regression or classification algorithm.<n>We construct a uniformly most powerful, randomized sign test for this median, yielding exact p-values for assessing feature significance and confidence intervals.<n>Experiments on synthetic tasks validate its statistical and computational advantages, and applications to real-world data illustrate its practical utility.
arXiv Detail & Related papers (2025-06-29T21:15:40Z) - From Prompts to Constructs: A Dual-Validity Framework for LLM Research in Psychology [0.0]
We argue that building a robust science of AI psychology requires integrating the principles of reliable measurement and the standards for sound causal inference.<n>We present a dual-validity framework to guide this integration, which clarifies how the evidence needed to support a claim scales with its scientific ambition.
arXiv Detail & Related papers (2025-06-20T02:38:42Z) - Humanizing LLMs: A Survey of Psychological Measurements with Tools, Datasets, and Human-Agent Applications [25.38031971196831]
Large language models (LLMs) are increasingly used in human-centered tasks.<n>Assessing their psychological traits is crucial for understanding their social impact and ensuring trustworthy AI alignment.<n>This study aims to propose future directions for developing more interpretable, robust, and generalizable psychological assessment frameworks for LLMs.
arXiv Detail & Related papers (2025-04-30T06:09:40Z) - FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models [79.41859481668618]
Large Language Models (LLMs) have significantly advanced the fact-checking studies.<n>Existing automated fact-checking evaluation methods rely on static datasets and classification metrics.<n>We introduce FACT-AUDIT, an agent-driven framework that adaptively and dynamically assesses LLMs' fact-checking capabilities.
arXiv Detail & Related papers (2025-02-25T07:44:22Z) - Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z) - Causal Representation Learning from Multimodal Biomedical Observations [57.00712157758845]
We develop flexible identification conditions for multimodal data and principled methods to facilitate the understanding of biomedical datasets.<n>Key theoretical contribution is the structural sparsity of causal connections between modalities.<n>Results on a real-world human phenotype dataset are consistent with established biomedical research.
arXiv Detail & Related papers (2024-11-10T16:40:27Z) - Designing LLM-Agents with Personalities: A Psychometric Approach [0.47498241053872914]
This research introduces a novel methodology for assigning quantifiable, controllable and psychometrically validated personalities to Agents.
It seeks to overcome the constraints of human subject studies, proposing Agents as an accessible tool for social science inquiry.
arXiv Detail & Related papers (2024-10-25T01:05:04Z) - Evaluating Large Language Models with Psychometrics [59.821829073478376]
This paper offers a comprehensive benchmark for quantifying psychological constructs of Large Language Models (LLMs)<n>Our work identifies five key psychological constructs -- personality, values, emotional intelligence, theory of mind, and self-efficacy -- assessed through a suite of 13 datasets.<n>We uncover significant discrepancies between LLMs' self-reported traits and their response patterns in real-world scenarios, revealing complexities in their behaviors.
arXiv Detail & Related papers (2024-06-25T16:09:08Z) - Decoding Susceptibility: Modeling Misbelief to Misinformation Through a Computational Approach [61.04606493712002]
Susceptibility to misinformation describes the degree of belief in unverifiable claims that is not observable.
Existing susceptibility studies heavily rely on self-reported beliefs.
We propose a computational approach to model users' latent susceptibility levels.
arXiv Detail & Related papers (2023-11-16T07:22:56Z) - Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - On the Reliability and Explainability of Language Models for Program
Generation [15.569926313298337]
We study the capabilities and limitations of automated program generation approaches.
We employ advanced explainable AI approaches to highlight the tokens that significantly contribute to the code transformation.
Our analysis reveals that, in various experimental scenarios, language models can recognize code grammar and structural information, but they exhibit limited robustness to changes in input sequences.
arXiv Detail & Related papers (2023-02-19T14:59:52Z) - Validation Diagnostics for SBI algorithms based on Normalizing Flows [55.41644538483948]
This work proposes easy to interpret validation diagnostics for multi-dimensional conditional (posterior) density estimators based on NF.
It also offers theoretical guarantees based on results of local consistency.
This work should help the design of better specified models or drive the development of novel SBI-algorithms.
arXiv Detail & Related papers (2022-11-17T15:48:06Z) - Neural Causal Models for Counterfactual Identification and Estimation [62.30444687707919]
We study the evaluation of counterfactual statements through neural models.
First, we show that neural causal models (NCMs) are expressive enough.
Second, we develop an algorithm for simultaneously identifying and estimating counterfactual distributions.
arXiv Detail & Related papers (2022-09-30T18:29:09Z) - Empirical Estimates on Hand Manipulation are Recoverable: A Step Towards
Individualized and Explainable Robotic Support in Everyday Activities [80.37857025201036]
Key challenge for robotic systems is to figure out the behavior of another agent.
Processing correct inferences is especially challenging when (confounding) factors are not controlled experimentally.
We propose equipping robots with the necessary tools to conduct observational studies on people.
arXiv Detail & Related papers (2022-01-27T22:15:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.