Related papers: Can Finetuing LLMs on Small Human Samples Increase Heterogeneity, Alignment, and Belief-Action Coherence?

Can Finetuing LLMs on Small Human Samples Increase Heterogeneity, Alignment, and Belief-Action Coherence?

URL: http://arxiv.org/abs/2511.21218v1
Date: Wed, 26 Nov 2025 09:50:42 GMT
Title: Can Finetuing LLMs on Small Human Samples Increase Heterogeneity, Alignment, and Belief-Action Coherence?
Authors: Steven Wang, Kyle Hunt, Shaojie Tang, Kenneth Joseph,
Abstract summary: Large language models (LLMs) can serve as substitutes for human participants in survey and experimental research.<n>LLMs often fail to align with real human behavior, exhibiting limited diversity, systematic misalignment for minority subgroups, insufficient within-group variance, and discrepancies between stated beliefs and actions.<n>This study examines whether fine-tuning on a small subset of human survey data, such as that obtainable from a pilot study, can mitigate these issues and yield realistic simulated outcomes.
Score: 9.310571879281186
License: http://creativecommons.org/licenses/by/4.0/
Abstract: There is ongoing debate about whether large language models (LLMs) can serve as substitutes for human participants in survey and experimental research. While recent work in fields such as marketing and psychology has explored the potential of LLM-based simulation, a growing body of evidence cautions against this practice: LLMs often fail to align with real human behavior, exhibiting limited diversity, systematic misalignment for minority subgroups, insufficient within-group variance, and discrepancies between stated beliefs and actions. This study examines an important and distinct question in this domain: whether fine-tuning on a small subset of human survey data, such as that obtainable from a pilot study, can mitigate these issues and yield realistic simulated outcomes. Using a behavioral experiment on information disclosure, we compare human and LLM-generated responses across multiple dimensions, including distributional divergence, subgroup alignment, belief-action coherence, and the recovery of regression coefficients. We find that fine-tuning on small human samples substantially improves heterogeneity, alignment, and belief-action coherence relative to the base model. However, even the best-performing fine-tuned models fail to reproduce the regression coefficients of the original study, suggesting that LLM-generated data remain unsuitable for replacing human participants in formal inferential analyses.

Related papers

This human study did not involve human subjects: Validating LLM simulations as behavioral evidence [15.56427716190418]
Heuristic approaches seek to establish that simulated and observed human behavior are interchangeable.<n> statistical calibration combines auxiliary human data with statistical adjustments to account for discrepancies between observed and simulated responses.
arXiv Detail & Related papers (2026-02-17T18:18:38Z)
Assessing the Reliability of Persona-Conditioned LLMs as Synthetic Survey Respondents [0.4277616907160855]
We use a large dataset of U.S. microdata to assess the impact of persona-conditioned simulations.<n>We find that persona prompting does not yield a clear aggregate improvement in survey alignment and, in many cases, significantly degrades performance.<n>Our findings highlight a key adverse impact of current persona-based simulation practices.
arXiv Detail & Related papers (2026-02-06T15:13:59Z)
Overstating Attitudes, Ignoring Networks: LLM Biases in Simulating Misinformation Susceptibility [7.616305266104683]
Large language models (LLMs) are increasingly used as proxies for human judgment in computational social science.<n>We test whether LLM-simulated survey respondents can reproduce human patterns of misinformation belief and sharing.
arXiv Detail & Related papers (2026-02-04T15:48:05Z)
Large language models replicate and predict human cooperation across experiments in game theory [0.8166364251367626]
How closely large language models mirror actual human decision-making remains poorly understood.<n>We develop a digital twin of game-theoretic experiments and introduce a systematic prompting and probing framework for machine-behavioral evaluation.<n>We find that Llama reproduces human cooperation patterns with high fidelity, capturing human deviations from rational choice theory.
arXiv Detail & Related papers (2025-11-06T16:21:27Z)
LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions [60.48458130500911]
We investigate whether emergent misalignment can extend beyond safety behaviors to a broader spectrum of dishonesty and deception under high-stakes scenarios.<n>We finetune open-sourced LLMs on misaligned completions across diverse domains.<n>We find that introducing as little as 1% of misalignment data into a standard downstream task is sufficient to decrease honest behavior over 20%.
arXiv Detail & Related papers (2025-10-09T13:35:19Z)
Predicting Effects, Missing Distributions: Evaluating LLMs as Human Behavior Simulators in Operations Management [11.302500716500893]
LLMs are emerging tools for simulating human behavior in business, economics, and social science.<n>This paper evaluates how well LLMs replicate human behavior in operations management.
arXiv Detail & Related papers (2025-09-30T20:20:58Z)
Can Generative AI agents behave like humans? Evidence from laboratory market experiments [0.0]
We explore the potential of Large Language Models to replicate human behavior in economic market experiments.<n>We compare LLM behavior to market dynamics observed in laboratory settings and assess their alignment with human participants' behavior.<n>These results suggest that LLMs hold promise as tools for simulating realistic human behavior in economic contexts.
arXiv Detail & Related papers (2025-05-12T11:44:46Z)
Human Preferences in Large Language Model Latent Space: A Technical Analysis on the Reliability of Synthetic Data in Voting Outcome Prediction [5.774786149181393]
We analyze how demographic attributes and prompt variations influence latent opinion mappings in large language models (LLMs)<n>We find that LLM-generated data fails to replicate the variance observed in real-world human responses.<n>In the political space, persona-to-party mappings exhibit limited differentiation, resulting in synthetic data that lacks the nuanced distribution of opinions found in survey data.
arXiv Detail & Related papers (2025-02-22T16:25:33Z)
Scaling Data Diversity for Fine-Tuning Language Models in Human Alignment [84.32768080422349]
Alignment with human preference prevents large language models from generating misleading or toxic content. We propose a new formulation of prompt diversity, implying a linear correlation with the final performance of LLMs after fine-tuning.
arXiv Detail & Related papers (2024-03-17T07:08:55Z)
Do LLMs exhibit human-like response biases? A case study in survey design [66.1850490474361]
We investigate the extent to which large language models (LLMs) reflect human response biases, if at all. We design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior.
arXiv Detail & Related papers (2023-11-07T15:40:43Z)
Reweighted Mixup for Subpopulation Shift [63.1315456651771]
Subpopulation shift exists in many real-world applications, which refers to the training and test distributions that contain the same subpopulation groups but with different subpopulation proportions. Importance reweighting is a classical and effective way to handle the subpopulation shift. We propose a simple yet practical framework, called reweighted mixup, to mitigate the overfitting issue.
arXiv Detail & Related papers (2023-04-09T03:44:50Z)
Equivariance Allows Handling Multiple Nuisance Variables When Analyzing Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution. We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z)
An Investigation of Why Overparameterization Exacerbates Spurious Correlations [98.3066727301239]
We identify two key properties of the training data that drive this behavior. We show how the inductive bias of models towards "memorizing" fewer examples can cause over parameterization to hurt.
arXiv Detail & Related papers (2020-05-09T01:59:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.