Hypothesis Testing for Quantifying LLM-Human Misalignment in Multiple Choice Settings
- URL: http://arxiv.org/abs/2506.14997v1
- Date: Tue, 17 Jun 2025 22:04:55 GMT
- Title: Hypothesis Testing for Quantifying LLM-Human Misalignment in Multiple Choice Settings
- Authors: Harbin Hong, Sebastian Caldas, Liu Leqi,
- Abstract summary: We assess the misalignment between Large Language Models (LLMs)-simulated and actual human behaviors in multiple-choice survey settings.<n>We apply this framework to a popular language model for simulating people's opinions in various public surveys.<n>This raises questions about the alignment of this language model with the tested populations.
- Score: 7.284860523651357
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As Large Language Models (LLMs) increasingly appear in social science research (e.g., economics and marketing), it becomes crucial to assess how well these models replicate human behavior. In this work, using hypothesis testing, we present a quantitative framework to assess the misalignment between LLM-simulated and actual human behaviors in multiple-choice survey settings. This framework allows us to determine in a principled way whether a specific language model can effectively simulate human opinions, decision-making, and general behaviors represented through multiple-choice options. We applied this framework to a popular language model for simulating people's opinions in various public surveys and found that this model is ill-suited for simulating the tested sub-populations (e.g., across different races, ages, and incomes) for contentious questions. This raises questions about the alignment of this language model with the tested populations, highlighting the need for new practices in using LLMs for social science studies beyond naive simulations of human subjects.
Related papers
- Mixture-of-Personas Language Models for Population Simulation [20.644911871150136]
Large Language Models (LLMs) can augment human-generated data in social science research and machine learning model training.<n>MoP is a contextual mixture model, where each component is an LM agent characterized by a persona and an exemplar representing subpopulation behaviors.<n>MoP is flexible, requires no model finetuning, and is transferable across base models.
arXiv Detail & Related papers (2025-04-07T12:43:05Z) - Leveraging Human Production-Interpretation Asymmetries to Test LLM Cognitive Plausibility [7.183662547358301]
We examine whether large language models process language similarly to humans.<n>We find that some LLMs do quantitatively and qualitatively reflect human-like asymmetries between production and interpretation.
arXiv Detail & Related papers (2025-03-21T23:25:42Z) - Specializing Large Language Models to Simulate Survey Response Distributions for Global Populations [49.908708778200115]
We are the first to specialize large language models (LLMs) for simulating survey response distributions.<n>As a testbed, we use country-level results from two global cultural surveys.<n>We devise a fine-tuning method based on first-token probabilities to minimize divergence between predicted and actual response distributions.
arXiv Detail & Related papers (2025-02-10T21:59:27Z) - HLB: Benchmarking LLMs' Humanlikeness in Language Use [2.438748974410787]
We present a comprehensive humanlikeness benchmark (HLB) evaluating 20 large language models (LLMs)
We collected responses from over 2,000 human participants and compared them to outputs from the LLMs in these experiments.
Our results reveal fine-grained differences in how well LLMs replicate human responses across various linguistic levels.
arXiv Detail & Related papers (2024-09-24T09:02:28Z) - Political Bias in LLMs: Unaligned Moral Values in Agent-centric Simulations [0.0]
We investigate how personalized language models align with human responses on the Moral Foundation Theory Questionnaire.<n>We adapt open-source generative language models to different political personas and repeatedly survey these models to generate synthetic data sets.<n>Our analysis reveals that models produce inconsistent results across multiple repetitions, yielding high response variance.
arXiv Detail & Related papers (2024-08-21T08:20:41Z) - Lessons from the Trenches on Reproducible Evaluation of Language Models [60.522749986793094]
We draw on three years of experience in evaluating large language models to provide guidance and lessons for researchers.
We present the Language Model Evaluation Harness (lm-eval), an open source library for independent, reproducible, and evaluation of language models.
arXiv Detail & Related papers (2024-05-23T16:50:49Z) - Using LLMs to Model the Beliefs and Preferences of Targeted Populations [4.0849074543032105]
We consider the problem of aligning a large language model (LLM) to model the preferences of a human population.
Modeling the beliefs, preferences, and behaviors of a specific population can be useful for a variety of different applications.
arXiv Detail & Related papers (2024-03-29T15:58:46Z) - Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models [61.45529177682614]
We challenge the prevailing constrained evaluation paradigm for values and opinions in large language models.
We show that models give substantively different answers when not forced.
We distill these findings into recommendations and open challenges in evaluating values and opinions in LLMs.
arXiv Detail & Related papers (2024-02-26T18:00:49Z) - A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive [53.08398658452411]
Large Language Models (LLMs) are increasingly utilized in autonomous decision-making.<n>We show that this sampling behavior resembles that of human decision-making.<n>We show that this deviation of a sample from the statistical norm towards a prescriptive component consistently appears in concepts across diverse real-world domains.
arXiv Detail & Related papers (2024-02-16T18:28:43Z) - Do LLMs exhibit human-like response biases? A case study in survey
design [66.1850490474361]
We investigate the extent to which large language models (LLMs) reflect human response biases, if at all.
We design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires.
Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior.
arXiv Detail & Related papers (2023-11-07T15:40:43Z) - Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural
Language Generation [68.9440575276396]
This survey aims to provide an overview of the recent research that has leveraged human feedback to improve natural language generation.
First, we introduce an encompassing formalization of feedback, and identify and organize existing research into a taxonomy following this formalization.
Second, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using the feedback or training feedback models.
Third, we provide an overview of the nascent field of AI feedback, which exploits large language models to make judgments based on a set of principles and minimize the need for
arXiv Detail & Related papers (2023-05-01T17:36:06Z) - Using Large Language Models to Simulate Multiple Humans and Replicate
Human Subject Studies [7.696359453385686]
We introduce a new type of test, called a Turing Experiment (TE)
A TE can reveal consistent distortions in a language model's simulation of a specific human behavior.
We compare how well different language models are able to reproduce classic economic, psycholinguistic, and social psychology experiments.
arXiv Detail & Related papers (2022-08-18T17:54:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.