Related papers: SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

URL: http://arxiv.org/abs/2510.17516v3
Date: Mon, 27 Oct 2025 14:17:13 GMT
Title: SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
Authors: Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Nigel Collier, Dirk Hovy, Paul Röttger,
Abstract summary: We introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation.<n>We show that even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size.<n>We demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning.
Score: 58.87134689752605
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We demonstrate an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.

Related papers

PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies [88.78188489161028]
We introduce Policy Evaluation and Environment Reconstruction in Simulation (PolaRiS)<n>PolaRiS is a scalable real-to-sim framework for high-fidelity simulated robot evaluation.<n>We show that PolaRiS evaluations provide a much stronger correlation to real world generalist policy performance than existing simulated benchmarks.
arXiv Detail & Related papers (2025-12-18T18:49:41Z)
Large Language Models as Virtual Survey Respondents: Evaluating Sociodemographic Response Generation [18.225151370273093]
This paper explores a new paradigm: simulating virtual survey respondents using Large Language Models (LLMs)<n>We introduce two novel simulation settings, namely Partial Attribute Simulation (PAS) and Full Attribute Simulation (FAS)<n>We curate a comprehensive benchmark suite, LLM-S3 (Large Language Model-based Sociodemographic Simulation Survey), that spans 11 real-world public datasets across four sociological domains.
arXiv Detail & Related papers (2025-09-08T04:59:00Z)
YuLan-OneSim: Towards the Next Generation of Social Simulator with Large Language Models [50.35333054932747]
We introduce a novel social simulator called YuLan-OneSim.<n>Users can simply describe and refine their simulation scenarios through natural language interactions with our simulator.<n>We implement 50 default simulation scenarios spanning 8 domains, including economics, sociology, politics, psychology, organization, demographics, law, and communication.
arXiv Detail & Related papers (2025-05-12T14:05:17Z)
Can LLMs Simulate Personas with Reversed Performance? A Benchmark for Counterfactual Instruction Following [12.145213376813155]
Large Language Models (LLMs) are increasingly widely used to simulate personas in virtual environments.<n>We show that even state-of-the-art LLMs cannot simulate personas with reversed performance.
arXiv Detail & Related papers (2025-04-08T22:00:32Z)
GausSim: Foreseeing Reality by Gaussian Simulator for Elastic Objects [55.02281855589641]
GausSim is a novel neural network-based simulator designed to capture the dynamic behaviors of real-world elastic objects represented through Gaussian kernels.<n>We leverage continuum mechanics and treat each kernel as a Center of Mass System (CMS) that represents continuous piece of matter.<n>In addition, GausSim incorporates explicit physics constraints, such as mass and momentum conservation, ensuring interpretable results and robust, physically plausible simulations.
arXiv Detail & Related papers (2024-12-23T18:58:17Z)
Sense and Sensitivity: Evaluating the simulation of social dynamics via Large Language Models [27.313165173789233]
Large language models have been proposed as a powerful replacement for classical agent-based models (ABMs) to simulate social dynamics.<n>However, due to the black box nature of LLMs, it is unclear whether LLM agents actually execute the intended semantics.<n>We show that while it is possible to engineer prompts that approximate the intended dynamics, the quality of these simulations is highly sensitive to the particular choice of prompts.
arXiv Detail & Related papers (2024-12-06T14:50:01Z)
GenSim: A General Social Simulation Platform with Large Language Model based Agents [111.00666003559324]
We propose a novel large language model (LLMs)-based simulation platform called textitGenSim.<n>Our platform supports one hundred thousand agents to better simulate large-scale populations in real-world contexts.<n>To our knowledge, GenSim represents an initial step toward a general, large-scale, and correctable social simulation platform.
arXiv Detail & Related papers (2024-10-06T05:02:23Z)
BeSimulator: A Large Language Model Powered Text-based Behavior Simulator [18.318419980796012]
We propose BeSimulator as an attempt towards behavior simulation in the context of text-based environments.<n>BeSimulator can generalize across scenarios and achieve long-horizon complex simulation.<n>Our experiments show a significant performance improvement in behavior simulation compared to baselines.
arXiv Detail & Related papers (2024-09-24T08:37:04Z)
User Behavior Simulation with Large Language Model based Agents [116.74368915420065]
We propose an LLM-based agent framework and design a sandbox environment to simulate real user behaviors. Based on extensive experiments, we find that the simulated behaviors of our method are very close to the ones of real humans.
arXiv Detail & Related papers (2023-06-05T02:58:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.