The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies
- URL: http://arxiv.org/abs/2509.18052v1
- Date: Mon, 22 Sep 2025 17:27:29 GMT
- Title: The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies
- Authors: Jiaxu Zhou, Jen-tse Huang, Xuhui Zhou, Man Ho Lam, Xintao Wang, Hao Zhu, Wenxuan Wang, Maarten Sap,
- Abstract summary: We find that many recent studies adopt experimental designs that systematically undermine the validity of their claims.<n>From a survey of over 40 papers, we identify six recurring methodological flaws.<n>We formalize these six requirements as the PIMMUR principles and argue they are necessary conditions for credible LLM-based social simulation.
- Score: 46.27915760967977
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are increasingly used for social simulation, where populations of agents are expected to reproduce human-like collective behavior. However, we find that many recent studies adopt experimental designs that systematically undermine the validity of their claims. From a survey of over 40 papers, we identify six recurring methodological flaws: agents are often homogeneous (Profile), interactions are absent or artificially imposed (Interaction), memory is discarded (Memory), prompts tightly control outcomes (Minimal-Control), agents can infer the experimental hypothesis (Unawareness), and validation relies on simplified theoretical models rather than real-world data (Realism). For instance, GPT-4o and Qwen-3 correctly infer the underlying social experiment in 53.1% of cases when given instructions from prior work-violating the Unawareness principle. We formalize these six requirements as the PIMMUR principles and argue they are necessary conditions for credible LLM-based social simulation. To demonstrate their impact, we re-run five representative studies using a framework that enforces PIMMUR and find that the reported social phenomena frequently fail to emerge under more rigorous conditions. Our work establishes methodological standards for LLM-based multi-agent research and provides a foundation for more reliable and reproducible claims about "AI societies."
Related papers
- The Qualitative Laboratory: Theory Prototyping and Hypothesis Generation with Large Language Models [0.0]
We argue that for this specific task, persona simulation offers a distinct advantage over established methods.<n>By generating naturalistic discourse, it overcomes the lack of discursive depth common in vignette surveys.<n>We present a protocol where personas derived from a sociological theory of climate reception react to policy messages.
arXiv Detail & Related papers (2025-11-25T08:31:48Z) - Leveraging LLM-based agents for social science research: insights from citation network simulations [132.4334196445918]
We introduce the CiteAgent framework, designed to generate citation networks based on human-behavior simulation.<n>CiteAgent captures predominant phenomena in real-world citation networks, including power-law distribution, citational distortion, and shrinking diameter.<n>We establish two LLM-based research paradigms in social science, allowing us to validate and challenge existing theories.
arXiv Detail & Related papers (2025-11-05T08:47:04Z) - The Role of Social Learning and Collective Norm Formation in Fostering Cooperation in LLM Multi-Agent Systems [13.628908663240564]
We introduce a CPR simulation framework that removes explicit reward signals and embeds cultural-evolutionary mechanisms.<n>We examine norm evolution across a $2times2$ grid of environmental and social initialisations.<n>Our results reveal systematic model differences in sustaining cooperation and norm formation.
arXiv Detail & Related papers (2025-10-16T07:59:31Z) - Doing Things with Words: Rethinking Theory of Mind Simulation in Large Language Models [48.815314312823006]
This study explores whether the Generative Agent-Based Model (GABM) Concordia can effectively model Theory of Mind (ToM) within simulated real-world environments.<n>We assess whether this framework successfully simulates ToM abilities and whether GPT-4 can perform tasks by making genuine inferences from social context.
arXiv Detail & Related papers (2025-10-15T10:48:31Z) - Population-Aligned Persona Generation for LLM-based Social Simulation [58.8436379542149]
We propose a systematic framework for synthesizing high-quality, population-aligned persona sets for social simulation.<n>Our approach begins by leveraging large language models to generate narrative personas from long-term social media data.<n>To address the needs of specific simulation contexts, we introduce a task-specific module that adapts the globally aligned persona set to targeted subpopulations.
arXiv Detail & Related papers (2025-09-12T10:43:47Z) - Simulating Generative Social Agents via Theory-Informed Workflow Design [11.992123170134185]
We propose a theory-informed framework that provides a systematic design process for social agents.<n>Our framework is grounded in principles from Social Cognition Theory and introduces three key modules: motivation, action planning, and learning.<n>Experiments demonstrate that our theory-driven agents reproduce realistic human behavior patterns under complex conditions.
arXiv Detail & Related papers (2025-08-12T08:14:48Z) - LLM-Based Social Simulations Require a Boundary [3.351170542925928]
This position paper argues that large language model (LLM)-based social simulations should establish clear boundaries.<n>We examine three key boundary problems: alignment (simulated behaviors matching real-world patterns), consistency (maintaining coherent agent behavior over time), and robustness.
arXiv Detail & Related papers (2025-06-24T17:14:47Z) - Modeling Earth-Scale Human-Like Societies with One Billion Agents [54.465233996410156]
Light Society is an agent-based simulation framework.<n>It formalizes social processes as structured transitions of agent and environment states.<n>It supports efficient simulation of societies with over one billion agents.
arXiv Detail & Related papers (2025-06-07T09:14:12Z) - MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback [136.27567671480156]
We introduce experiment-guided ranking, which prioritizes hypotheses based on feedback from prior tests.<n>We frame experiment-guided ranking as a sequential decision-making problem.<n>Our approach significantly outperforms pre-experiment baselines and strong ablations.
arXiv Detail & Related papers (2025-05-23T13:24:50Z) - GenSim: A General Social Simulation Platform with Large Language Model based Agents [111.00666003559324]
We propose a novel large language model (LLMs)-based simulation platform called textitGenSim.<n>Our platform supports one hundred thousand agents to better simulate large-scale populations in real-world contexts.<n>To our knowledge, GenSim represents an initial step toward a general, large-scale, and correctable social simulation platform.
arXiv Detail & Related papers (2024-10-06T05:02:23Z) - Investigating the Robustness of Counterfactual Learning to Rank Models: A Reproducibility Study [71.04084063541777]
Counterfactual learning to rank has attracted extensive attention in the IR community.<n>Models can be theoretically unbiased when the user behavior assumption is correct and the propensity estimation is accurate.<n>Their effectiveness is usually empirically evaluated via simulation-based experiments due to a lack of widely available, large-scale, real click logs.
arXiv Detail & Related papers (2024-04-04T10:54:38Z) - LLM-driven Imitation of Subrational Behavior : Illusion or Reality? [3.2365468114603937]
Existing work highlights the ability of Large Language Models to address complex reasoning tasks and mimic human communication.
We propose to investigate the use of LLMs to generate synthetic human demonstrations, which are then used to learn subrational agent policies.
We experimentally evaluate the ability of our framework to model sub-rationality through four simple scenarios.
arXiv Detail & Related papers (2024-02-13T19:46:39Z) - How Far Are LLMs from Believable AI? A Benchmark for Evaluating the Believability of Human Behavior Simulation [46.42384207122049]
We design SimulateBench to evaluate the believability of large language models (LLMs) when simulating human behaviors.
Based on SimulateBench, we evaluate the performances of 10 widely used LLMs when simulating characters.
arXiv Detail & Related papers (2023-12-28T16:51:11Z) - Do LLM Agents Exhibit Social Behavior? [5.094340963261968]
State-Understanding-Value-Action (SUVA) is a framework to systematically analyze responses in social contexts.
It assesses social behavior through both their final decisions and the response generation processes leading to those decisions.
We demonstrate that utterance-based reasoning reliably predicts LLMs' final actions.
arXiv Detail & Related papers (2023-12-23T08:46:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.