Related papers: ChatGPT is not A Man but Das Man: Representativeness and Structural Consistency of Silicon Samples Generated by Large Language Models

ChatGPT is not A Man but Das Man: Representativeness and Structural Consistency of Silicon Samples Generated by Large Language Models

URL: http://arxiv.org/abs/2507.02919v1
Date: Wed, 25 Jun 2025 12:35:44 GMT
Title: ChatGPT is not A Man but Das Man: Representativeness and Structural Consistency of Silicon Samples Generated by Large Language Models
Authors: Dai Li, Linzhuo Li, Huilian Sophie Qiu,
Abstract summary: Large language models (LLMs) are proposed as "silicon samples" for simulating human opinions.<n>This study examines this notion, arguing that LLMs may misrepresent population-level opinions.
Score: 4.066868402300836
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) in the form of chatbots like ChatGPT and Llama are increasingly proposed as "silicon samples" for simulating human opinions. This study examines this notion, arguing that LLMs may misrepresent population-level opinions. We identify two fundamental challenges: a failure in structural consistency, where response accuracy doesn't hold across demographic aggregation levels, and homogenization, an underrepresentation of minority opinions. To investigate these, we prompted ChatGPT (GPT-4) and Meta's Llama 3.1 series (8B, 70B, 405B) with questions on abortion and unauthorized immigration from the American National Election Studies (ANES) 2020. Our findings reveal significant structural inconsistencies and severe homogenization in LLM responses compared to human data. We propose an "accuracy-optimization hypothesis," suggesting homogenization stems from prioritizing modal responses. These issues challenge the validity of using LLMs, especially chatbots AI, as direct substitutes for human survey data, potentially reinforcing stereotypes and misinforming policy.

Related papers

Hypothesis Testing for Quantifying LLM-Human Misalignment in Multiple Choice Settings [7.284860523651357]
We assess the misalignment between Large Language Models (LLMs)-simulated and actual human behaviors in multiple-choice survey settings.<n>We apply this framework to a popular language model for simulating people's opinions in various public surveys.<n>This raises questions about the alignment of this language model with the tested populations.
arXiv Detail & Related papers (2025-06-17T22:04:55Z)
Specializing Large Language Models to Simulate Survey Response Distributions for Global Populations [49.908708778200115]
We are the first to specialize large language models (LLMs) for simulating survey response distributions.<n>As a testbed, we use country-level results from two global cultural surveys.<n>We devise a fine-tuning method based on first-token probabilities to minimize divergence between predicted and actual response distributions.
arXiv Detail & Related papers (2025-02-10T21:59:27Z)
ChatGPT vs Social Surveys: Probing Objective and Subjective Silicon Population [7.281887764378982]
Large Language Models (LLMs) have the potential to simulate human responses in social surveys and generate reliable predictions.<n>We employ repeated random sampling to create sampling distributions that identify the population parameters of silicon samples generated by GPT.<n>Our findings show that GPT's demographic distribution aligns with the 2020 U.S. population in terms of gender and average age.<n> GPT's point estimates for attitudinal scores are highly inconsistent and show no clear inclination toward any particular ideology.
arXiv Detail & Related papers (2024-09-04T10:33:37Z)
Are Large Language Models Chameleons? An Attempt to Simulate Social Surveys [1.5727456947901746]
We conducted millions of simulations in which large language models (LLMs) were asked to answer subjective questions. A comparison of different LLM responses with the European Social Survey (ESS) data suggests that the effect of prompts on bias and variability is fundamental.
arXiv Detail & Related papers (2024-05-29T17:54:22Z)
Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models [61.45529177682614]
We challenge the prevailing constrained evaluation paradigm for values and opinions in large language models. We show that models give substantively different answers when not forced. We distill these findings into recommendations and open challenges in evaluating values and opinions in LLMs.
arXiv Detail & Related papers (2024-02-26T18:00:49Z)
You don't need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments [37.03210795084276]
We examine whether the current format of prompting Large Language Models elicits responses in a consistent and robust manner. Our experiments on 17 different LLMs reveal that even simple perturbations significantly downgrade a model's question-answering ability. Our results suggest that the currently widespread practice of prompting is insufficient to accurately and reliably capture model perceptions.
arXiv Detail & Related papers (2023-11-16T09:50:53Z)
Do LLMs exhibit human-like response biases? A case study in survey design [66.1850490474361]
We investigate the extent to which large language models (LLMs) reflect human response biases, if at all. We design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior.
arXiv Detail & Related papers (2023-11-07T15:40:43Z)
Demonstrations of the Potential of AI-based Political Issue Polling [0.0]
We develop a prompt engineering methodology for eliciting human-like survey responses from ChatGPT. We execute large scale experiments, querying for thousands of simulated responses at a cost far lower than human surveys. We find ChatGPT is effective at anticipating both the mean level and distribution of public opinion on a variety of policy issues. But it is less successful at anticipating demographic-level differences.
arXiv Detail & Related papers (2023-07-10T12:17:15Z)
Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization. We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)
Whose Opinions Do Language Models Reflect? [88.35520051971538]
We investigate the opinions reflected by language models (LMs) by leveraging high-quality public opinion polls and their associated human responses. We find substantial misalignment between the views reflected by current LMs and those of US demographic groups. Our analysis confirms prior observations about the left-leaning tendencies of some human feedback-tuned LMs.
arXiv Detail & Related papers (2023-03-30T17:17:08Z)
Can ChatGPT Assess Human Personalities? A General Evaluation Framework [70.90142717649785]
Large Language Models (LLMs) have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored. This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests.
arXiv Detail & Related papers (2023-03-01T06:16:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.