HumanStudy-Bench: Towards AI Agent Design for Participant Simulation
- URL: http://arxiv.org/abs/2602.00685v1
- Date: Sat, 31 Jan 2026 12:07:42 GMT
- Title: HumanStudy-Bench: Towards AI Agent Design for Participant Simulation
- Authors: Xuan Liu, Haoyang Shang, Zizhang Liu, Xinyan Liu, Yunze Xiao, Yiwen Tu, Haojian Jin,
- Abstract summary: Large language models (LLMs) are increasingly used as simulated participants in social science experiments.<n>We introduce HUMANSTUDY-BENCH, a benchmark and execution engine that orchestrates LLM-based agents to reconstruct human-subject experiments.<n>To evaluate fidelity at the level of scientific inference, we propose new metrics to quantify how much human and agent behaviors agree.
- Score: 11.906370453952265
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are increasingly used as simulated participants in social science experiments, but their behavior is often unstable and highly sensitive to design choices. Prior evaluations frequently conflate base-model capabilities with experimental instantiation, obscuring whether outcomes reflect the model itself or the agent setup. We instead frame participant simulation as an agent-design problem over full experimental protocols, where an agent is defined by a base model and a specification (e.g., participant attributes) that encodes behavioral assumptions. We introduce HUMANSTUDY-BENCH, a benchmark and execution engine that orchestrates LLM-based agents to reconstruct published human-subject experiments via a Filter--Extract--Execute--Evaluate pipeline, replaying trial sequences and running the original analysis pipeline in a shared runtime that preserves the original statistical procedures end to end. To evaluate fidelity at the level of scientific inference, we propose new metrics to quantify how much human and agent behaviors agree. We instantiate 12 foundational studies as an initial suite in this dynamic benchmark, spanning individual cognition, strategic interaction, and social psychology, and covering more than 6,000 trials with human samples ranging from tens to over 2,100 participants.
Related papers
- Individual Turing Test: A Case Study of LLM-based Simulation Using Longitudinal Personal Data [54.145424717168794]
Large Language Models (LLMs) have demonstrated remarkable human-like capabilities, yet their ability to replicate a specific individual remains under-explored.<n>This paper presents a case study to investigate LLM-based individual simulation with a volunteer-contributed archive of private messaging history spanning over ten years.<n>We propose the "Individual Turing Test" to evaluate whether acquaintances of the volunteer can correctly identify which response in a multi-candidate pool most plausibly comes from the volunteer.
arXiv Detail & Related papers (2026-03-01T21:46:27Z) - Large language models replicate and predict human cooperation across experiments in game theory [0.8166364251367626]
How closely large language models mirror actual human decision-making remains poorly understood.<n>We develop a digital twin of game-theoretic experiments and introduce a systematic prompting and probing framework for machine-behavioral evaluation.<n>We find that Llama reproduces human cooperation patterns with high fidelity, capturing human deviations from rational choice theory.
arXiv Detail & Related papers (2025-11-06T16:21:27Z) - EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis [0.0]
Large Language Models (LLMs) offer new opportunities to automate complex interdisciplinary research.<n>EpidemIQs is a novel multi-agent LLM framework that integrates user inputs and autonomously conducts literature review, analytical derivation, network modeling, invoking simulations, data visualization and analysis, and finally documentation of findings in a structured manuscript.<n>We evaluate EpidemIQs across different scenarios measuring computational cost, completion success rate, and AI and human expert reviews of generated reports.
arXiv Detail & Related papers (2025-09-24T18:54:56Z) - YuLan-OneSim: Towards the Next Generation of Social Simulator with Large Language Models [50.35333054932747]
We introduce a novel social simulator called YuLan-OneSim.<n>Users can simply describe and refine their simulation scenarios through natural language interactions with our simulator.<n>We implement 50 default simulation scenarios spanning 8 domains, including economics, sociology, politics, psychology, organization, demographics, law, and communication.
arXiv Detail & Related papers (2025-05-12T14:05:17Z) - Boosting Virtual Agent Learning and Reasoning: A Step-Wise, Multi-Dimensional, and Generalist Reward Model with Benchmark [72.46357004059661]
Generalist Virtual Agents (GVAs) have shown significant promise in autonomous task execution.<n>To address these challenges, we propose Similar, a Step-Wise Multi-Dimensional Generalist Reward Model.<n>Similar offers fine-grained signals for agent training and can choose better action for inference-time scaling.
arXiv Detail & Related papers (2025-03-24T13:30:47Z) - Multi-Agent Sampling: Scaling Inference Compute for Data Synthesis with Tree Search-Based Agentic Collaboration [81.45763823762682]
This work aims to bridge the gap by investigating the problem of data synthesis through multi-agent sampling.<n>We introduce Tree Search-based Orchestrated Agents(TOA), where the workflow evolves iteratively during the sequential sampling process.<n>Our experiments on alignment, machine translation, and mathematical reasoning demonstrate that multi-agent sampling significantly outperforms single-agent sampling as inference compute scales.
arXiv Detail & Related papers (2024-12-22T15:16:44Z) - LLMs Can Simulate Standardized Patients via Agent Coevolution [8.539733225671059]
Training medical personnel using standardized patients (SPs) remains a complex challenge.<n>EvoPatient is a novel simulated patient framework in which a patient agent and doctor agents simulate the diagnostic process through multi-turn dialogues.<n>Our framework improves over existing reasoning methods by more than 10% in requirement alignment and better human preference.
arXiv Detail & Related papers (2024-12-16T12:36:47Z) - Generative Agent Simulations of 1,000 People [56.82159813294894]
We present a novel agent architecture that simulates the attitudes and behaviors of 1,052 real individuals.
The generative agents replicate participants' responses on the General Social Survey 85% as accurately as participants replicate their own answers.
Our architecture reduces accuracy biases across racial and ideological groups compared to agents given demographic descriptions.
arXiv Detail & Related papers (2024-11-15T11:14:34Z) - PersLLM: A Personified Training Approach for Large Language Models [66.16513246245401]
We propose PersLLM, a framework for better data construction and model tuning.<n>For insufficient data usage, we incorporate strategies such as Chain-of-Thought prompting and anti-induction.<n>For rigid behavior patterns, we design the tuning process and introduce automated DPO to enhance the specificity and dynamism of the models' personalities.
arXiv Detail & Related papers (2024-07-17T08:13:22Z) - Investigating the Robustness of Counterfactual Learning to Rank Models: A Reproducibility Study [71.04084063541777]
Counterfactual learning to rank has attracted extensive attention in the IR community.<n>Models can be theoretically unbiased when the user behavior assumption is correct and the propensity estimation is accurate.<n>Their effectiveness is usually empirically evaluated via simulation-based experiments due to a lack of widely available, large-scale, real click logs.
arXiv Detail & Related papers (2024-04-04T10:54:38Z) - User Behavior Simulation with Large Language Model based Agents [116.74368915420065]
We propose an LLM-based agent framework and design a sandbox environment to simulate real user behaviors.
Based on extensive experiments, we find that the simulated behaviors of our method are very close to the ones of real humans.
arXiv Detail & Related papers (2023-06-05T02:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.