Related papers: Identifying Non-Replicable Social Science Studies with Language Models

Identifying Non-Replicable Social Science Studies with Language Models

URL: http://arxiv.org/abs/2503.10671v1
Date: Mon, 10 Mar 2025 11:48:05 GMT
Title: Identifying Non-Replicable Social Science Studies with Language Models
Authors: Denitsa Saynova, Kajsa Hansson, Bastiaan Bruinsma, Annika Fredén, Moa Johansson,
Abstract summary: We evaluate the ability of open-source (Llama 3 8B, Qwen 2 7B, Mistral 7B) and proprietary (GPT-4o) instruction-tuned LLMs to discriminate between replicable and non-replicable findings.<n>We use LLMs to generate synthetic samples of responses from behavioural studies and estimate whether the measured effects support the original findings.
Score: 2.621434923709917
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this study, we investigate whether LLMs can be used to indicate if a study in the behavioural social sciences is replicable. Using a dataset of 14 previously replicated studies (9 successful, 5 unsuccessful), we evaluate the ability of both open-source (Llama 3 8B, Qwen 2 7B, Mistral 7B) and proprietary (GPT-4o) instruction-tuned LLMs to discriminate between replicable and non-replicable findings. We use LLMs to generate synthetic samples of responses from behavioural studies and estimate whether the measured effects support the original findings. When compared with human replication results for these studies, we achieve F1 values of up to $77\%$ with Mistral 7B, $67\%$ with GPT-4o and Llama 3 8B, and $55\%$ with Qwen 2 7B, suggesting their potential for this task. We also analyse how effect size calculations are affected by sampling temperature and find that low variance (due to temperature) leads to biased effect estimates.

Related papers

Delving Into the Psychology of Machines: Exploring the Structure of Self-Regulated Learning via LLM-Generated Survey Responses [0.0]
Large language models (LLMs) offer the potential to simulate human-like responses and behaviors.<n>LLMs could be used to test intervention scenarios, refine theoretical models, augment sparse datasets, and represent hard-to-reach populations.<n>We analyzed item distributions, the psychological network of the theoretical SRL dimensions, and psychometric validity based on the latent factor structure.
arXiv Detail & Related papers (2025-06-16T11:48:58Z)
"Check My Work?": Measuring Sycophancy in a Simulated Educational Context [0.0]
This study examines how user-provided suggestions affect Large Language Models (LLMs) in a simulated educational context.<n>We show that response quality varies dramatically based on query framing.<n>Our results highlight the need to better understand the mechanism, and ways to mitigate, such bias in the educational context.
arXiv Detail & Related papers (2025-06-12T02:21:43Z)
ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition [67.26124739345332]
Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined. We introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery. We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers.
arXiv Detail & Related papers (2025-03-27T08:09:15Z)
Highlighting Case Studies in LLM Literature Review of Interdisciplinary System Science [0.18416014644193066]
Large Language Models (LLMs) were used to assist four Commonwealth Scientific and Industrial Research Organisation (CSIRO) researchers. We evaluate the performance of LLMs for systematic literature reviews.
arXiv Detail & Related papers (2025-03-16T05:52:18Z)
LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs. We introduce LLM2, a novel framework that combines an LLM with a process-based verifier. LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z)
LLM Robustness Against Misinformation in Biomedical Question Answering [50.98256373698759]
The retrieval-augmented generation (RAG) approach is used to reduce the confabulation of large language models (LLMs) for question answering. We evaluate the effectiveness and robustness of four LLMs against misinformation in answering biomedical questions.
arXiv Detail & Related papers (2024-10-27T16:23:26Z)
Hypothesis-only Biases in Large Language Model-Elicited Natural Language Inference [3.0804372027733202]
We recreate a portion of the Stanford NLI corpus using GPT-4, Llama-2 and Mistral 7b. We train hypothesis-only classifiers to determine whether LLM-elicited hypotheses contain annotation artifacts. Our analysis provides empirical evidence that well-attested biases in NLI can persist in LLM-generated data.
arXiv Detail & Related papers (2024-10-11T17:09:22Z)
Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation [51.127054971591924]
We introduce a new generative self-evaluation scheme designed to adaptively reduce the number of generated samples. We demonstrate that 74% of the improvement from using 16 samples can be achieved with only 1.2 samples on average.
arXiv Detail & Related papers (2024-10-03T17:47:29Z)
Can AI Replace Human Subjects? A Large-Scale Replication of Psychological Experiments with LLMs [1.5031024722977635]
GPT-4 successfully replicates 76.0 percent of main effects and 47.0 percent of interaction effects observed in the original studies. GPT-4's replicated confidence intervals contain the original effect sizes, with the majority of replicated effect sizes exceeding the 95 percent confidence interval of the original studies. Our results demonstrate the potential of LLMs as powerful tools in psychological research but also emphasize the need for caution in interpreting AI-driven findings.
arXiv Detail & Related papers (2024-08-29T05:18:50Z)
A Comprehensive Study on Large Language Models for Mutation Testing [39.991649516721424]
Large Language Models (LLMs) have recently been used to generate mutants in both research work and in industrial practice.<n>We report the results of a comprehensive empirical study over six different LLMs on 851 real bugs drawn from two different Java real-world bug benchmarks.<n>Our results reveal that, compared to existing rule-based approaches, LLMs generate more diverse mutants, that are behaviorally closer to real bugs and, most importantly, with 90.1% higher fault detection.
arXiv Detail & Related papers (2024-06-14T08:49:41Z)
Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study [0.28318468414401093]
This paper describes a rapid feasibility study of using GPT-4, a large language model (LLM), to (semi)automate data extraction in systematic reviews.<n>Overall, results indicated an accuracy of around 80%, with some variability between domains.
arXiv Detail & Related papers (2024-05-23T11:24:23Z)
Exploring Value Biases: How LLMs Deviate Towards the Ideal [57.99044181599786]
Large-Language-Models (LLMs) are deployed in a wide range of applications, and their response has an increasing social impact. We show that value bias is strong in LLMs across different categories, similar to the results found in human studies.
arXiv Detail & Related papers (2024-02-16T18:28:43Z)
The Challenge of Using LLMs to Simulate Human Behavior: A Causal Inference Perspective [0.27624021966289597]
Large Language Models (LLMs) have shown impressive potential to simulate human behavior.<n>We identify a fundamental challenge in using them to simulate experiments.<n>When LLM-simulated subjects are blind to the experimental design, variations in treatment systematically affect unspecified variables.
arXiv Detail & Related papers (2023-12-24T16:32:35Z)
Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment [82.60594940370919]
We propose the FlipFlop experiment to study the multi-turn behavior of Large Language Models (LLMs) We show that models flip their answers on average 46% of the time and that all models see a deterioration of accuracy between their first and final prediction, with an average drop of 17% (the FlipFlop effect) We conduct finetuning experiments on an open-source LLM and find that finetuning on synthetically created data can mitigate - reducing performance deterioration by 60% - but not resolve sycophantic behavior entirely.
arXiv Detail & Related papers (2023-11-14T23:40:22Z)
Do LLMs exhibit human-like response biases? A case study in survey design [66.1850490474361]
We investigate the extent to which large language models (LLMs) reflect human response biases, if at all. We design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior.
arXiv Detail & Related papers (2023-11-07T15:40:43Z)
Mastering the Task of Open Information Extraction with Large Language Models and Consistent Reasoning Environment [52.592199835286394]
Open Information Extraction (OIE) aims to extract objective structured knowledge from natural texts. Large language models (LLMs) have exhibited remarkable in-context learning capabilities.
arXiv Detail & Related papers (2023-10-16T17:11:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.