Can AI Replace Human Subjects? A Large-Scale Replication of Psychological Experiments with LLMs
- URL: http://arxiv.org/abs/2409.00128v2
- Date: Wed, 4 Sep 2024 03:21:07 GMT
- Title: Can AI Replace Human Subjects? A Large-Scale Replication of Psychological Experiments with LLMs
- Authors: Ziyan Cui, Ning Li, Huaikang Zhou,
- Abstract summary: GPT-4 successfully replicates 76.0 percent of main effects and 47.0 percent of interaction effects observed in the original studies.
GPT-4's replicated confidence intervals contain the original effect sizes, with the majority of replicated effect sizes exceeding the 95 percent confidence interval of the original studies.
Our results demonstrate the potential of LLMs as powerful tools in psychological research but also emphasize the need for caution in interpreting AI-driven findings.
- Score: 1.5031024722977635
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Artificial Intelligence (AI) is increasingly being integrated into scientific research, particularly in the social sciences, where understanding human behavior is critical. Large Language Models (LLMs) like GPT-4 have shown promise in replicating human-like responses in various psychological experiments. However, the extent to which LLMs can effectively replace human subjects across diverse experimental contexts remains unclear. Here, we conduct a large-scale study replicating 154 psychological experiments from top social science journals with 618 main effects and 138 interaction effects using GPT-4 as a simulated participant. We find that GPT-4 successfully replicates 76.0 percent of main effects and 47.0 percent of interaction effects observed in the original studies, closely mirroring human responses in both direction and significance. However, only 19.44 percent of GPT-4's replicated confidence intervals contain the original effect sizes, with the majority of replicated effect sizes exceeding the 95 percent confidence interval of the original studies. Additionally, there is a 71.6 percent rate of unexpected significant results where the original studies reported null findings, suggesting potential overestimation or false positives. Our results demonstrate the potential of LLMs as powerful tools in psychological research but also emphasize the need for caution in interpreting AI-driven findings. While LLMs can complement human studies, they cannot yet fully replace the nuanced insights provided by human subjects.
Related papers
- Using AI to replicate human experimental results: a motion study [0.11838866556981258]
This paper explores the potential of large language models (LLMs) as reliable analytical tools in linguistic research.<n>It focuses on the emergence of affective meanings in temporal expressions involving manner-of-motion verbs.
arXiv Detail & Related papers (2025-07-14T14:47:01Z) - ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition [67.26124739345332]
Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined.
We introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery.
We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers.
arXiv Detail & Related papers (2025-03-27T08:09:15Z) - Identifying Non-Replicable Social Science Studies with Language Models [2.621434923709917]
We evaluate the ability of open-source (Llama 3 8B, Qwen 2 7B, Mistral 7B) and proprietary (GPT-4o) instruction-tuned LLMs to discriminate between replicable and non-replicable findings.
We use LLMs to generate synthetic samples of responses from behavioural studies and estimate whether the measured effects support the original findings.
arXiv Detail & Related papers (2025-03-10T11:48:05Z) - Boosting Self-Efficacy and Performance of Large Language Models via Verbal Efficacy Stimulations [10.209999691197948]
This paper introduces Verbal Efficacy Stimulations (VES)<n>VES comprises three types of verbal prompts: encouraging, provocative, and critical, addressing six aspects such as helpfulness and competence.<n>The experimental results show that the three types of VES improve the performance of LLMs on most tasks, and the most effective VES varies for different models.
arXiv Detail & Related papers (2025-02-10T16:54:03Z) - Using Large Language Models to Create AI Personas for Replication and Prediction of Media Effects: An Empirical Test of 133 Published Experimental Research Findings [0.3749861135832072]
This report analyzes the potential for large language models (LLMs) to expedite accurate replication of message effects studies.
We tested LLM-powered participants by replicating 133 experimental findings from 14 papers containing 45 recent studies in the Journal of Marketing.
Our LLM replications successfully reproduced 76% of the original main effects (84 out of 111), demonstrating strong potential for AI-assisted replication of studies in which people respond to media stimuli.
arXiv Detail & Related papers (2024-08-28T18:14:39Z) - Investigating Critical Period Effects in Language Acquisition through Neural Language Models [70.6367059367609]
Second language (L2) acquisition becomes harder after early childhood.
ceasing exposure to a first language (L1) after this period (but not before) typically does not lead to substantial loss of L1 proficiency.
It is unknown whether these CP effects result from innately determined brain maturation or as a stabilization of neural connections naturally induced by experience.
arXiv Detail & Related papers (2024-07-27T19:17:10Z) - The Efficacy of Conversational Artificial Intelligence in Rectifying the Theory of Mind and Autonomy Biases: Comparative Analysis [0.0]
The increasing deployment of Conversational Artificial Intelligence (CAI) in mental health interventions necessitates an evaluation of their efficacy in rectifying cognitive biases and recognizing affect in human-AI interactions.
This study aimed to assess the effectiveness of therapeutic chatbots versus general-purpose language models (GPT-3.5, GPT-4, Gemini Pro) in identifying and rectifying cognitive biases and recognizing affect in user interactions.
arXiv Detail & Related papers (2024-06-19T20:20:28Z) - Are Large Language Models More Empathetic than Humans? [14.18033127602866]
GPT-4 emerged as the most empathetic, marking approximately 31% increase in responses rated as "Good" compared to the human benchmark.
Some LLMs are significantly better at responding to specific emotions compared to others.
arXiv Detail & Related papers (2024-06-07T16:33:43Z) - Smoke and Mirrors in Causal Downstream Tasks [59.90654397037007]
This paper looks at the causal inference task of treatment effect estimation, where the outcome of interest is recorded in high-dimensional observations.
We compare 6 480 models fine-tuned from state-of-the-art visual backbones, and find that the sampling and modeling choices significantly affect the accuracy of the causal estimate.
Our results suggest that future benchmarks should carefully consider real downstream scientific questions, especially causal ones.
arXiv Detail & Related papers (2024-05-27T13:26:34Z) - Influence of Solution Efficiency and Valence of Instruction on Additive and Subtractive Solution Strategies in Humans and GPT-4 [0.0]
This study compares human and GPT-4 problem-solving across both spatial and linguistic tasks.
Four experiments with 588 participants from the U.S. and 680 GPT-4 iterations revealed a stronger tendency towards additive transformations in GPT-4 than in humans.
arXiv Detail & Related papers (2024-04-25T15:53:00Z) - Large language models surpass human experts in predicting neuroscience results [60.26891446026707]
Large language models (LLMs) forecast novel results better than human experts.
BrainBench is a benchmark for predicting neuroscience results.
Our approach is not neuroscience-specific and is transferable to other knowledge-intensive endeavors.
arXiv Detail & Related papers (2024-03-04T15:27:59Z) - Discovery of the Hidden World with Large Language Models [95.58823685009727]
This paper presents Causal representatiOn AssistanT (COAT) that introduces large language models (LLMs) to bridge the gap.
LLMs are trained on massive observations of the world and have demonstrated great capability in extracting key information from unstructured data.
COAT also adopts CDs to find causal relations among the identified variables as well as to provide feedback to LLMs to iteratively refine the proposed factors.
arXiv Detail & Related papers (2024-02-06T12:18:54Z) - Do LLMs exhibit human-like response biases? A case study in survey
design [66.1850490474361]
We investigate the extent to which large language models (LLMs) reflect human response biases, if at all.
We design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires.
Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior.
arXiv Detail & Related papers (2023-11-07T15:40:43Z) - Language Models Hallucinate, but May Excel at Fact Verification [89.0833981569957]
Large language models (LLMs) frequently "hallucinate," resulting in non-factual outputs.
Even GPT-3.5 produces factual outputs less than 25% of the time.
This underscores the importance of fact verifiers in order to measure and incentivize progress.
arXiv Detail & Related papers (2023-10-23T04:39:01Z) - Can large language models provide useful feedback on research papers? A
large-scale empirical analysis [38.905758846360435]
High-quality peer reviews are increasingly difficult to obtain.
With the breakthrough of large language models (LLM) such as GPT-4, there is growing interest in using LLMs to generate scientific feedback.
We created an automated pipeline using GPT-4 to provide comments on the full PDFs of scientific papers.
arXiv Detail & Related papers (2023-10-03T04:14:17Z) - Sensitivity, Performance, Robustness: Deconstructing the Effect of
Sociodemographic Prompting [64.80538055623842]
sociodemographic prompting is a technique that steers the output of prompt-based models towards answers that humans with specific sociodemographic profiles would give.
We show that sociodemographic information affects model predictions and can be beneficial for improving zero-shot learning in subjective NLP tasks.
arXiv Detail & Related papers (2023-09-13T15:42:06Z) - Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using EmotionBench [83.41621219298489]
We evaluate Large Language Models' (LLMs) anthropomorphic capabilities using the emotion appraisal theory from psychology.
We collect a dataset containing over 400 situations that have proven effective in eliciting the eight emotions central to our study.
We conduct a human evaluation involving more than 1,200 subjects worldwide.
arXiv Detail & Related papers (2023-08-07T15:18:30Z) - Large Language Models Understand and Can be Enhanced by Emotional
Stimuli [53.53886609012119]
We take the first step towards exploring the ability of Large Language Models to understand emotional stimuli.
Our experiments show that LLMs have a grasp of emotional intelligence, and their performance can be improved with emotional prompts.
Our human study results demonstrate that EmotionPrompt significantly boosts the performance of generative tasks.
arXiv Detail & Related papers (2023-07-14T00:57:12Z) - Exploring Qualitative Research Using LLMs [8.545798128849091]
This study aimed to compare and contrast the comprehension capabilities of humans and AI driven large language models.
We conducted an experiment with small sample of Alexa app reviews, initially classified by a human analyst.
LLMs were then asked to classify these reviews and provide the reasoning behind each classification.
arXiv Detail & Related papers (2023-06-23T05:21:36Z) - Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models.
Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z) - Do Large Language Models Show Decision Heuristics Similar to Humans? A
Case Study Using GPT-3.5 [0.0]
GPT-3.5 is an example of an LLM that supports a conversational agent called ChatGPT.
In this work, we used a series of novel prompts to determine whether ChatGPT shows biases, and other decision effects.
We also tested the same prompts on human participants.
arXiv Detail & Related papers (2023-05-08T01:02:52Z) - Susceptibility to Influence of Large Language Models [5.931099001882958]
Two studies tested the hypothesis that a Large Language Model (LLM) can be used to model psychological change following exposure to influential input.
The first study tested a generic mode of influence - the Illusory Truth Effect (ITE)
The second study concerns a specific mode of influence - populist framing of news to increase its persuasion and political mobilization.
arXiv Detail & Related papers (2023-03-10T16:53:30Z) - Can ChatGPT Assess Human Personalities? A General Evaluation Framework [70.90142717649785]
Large Language Models (LLMs) have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored.
This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests.
arXiv Detail & Related papers (2023-03-01T06:16:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.