Diminished Diversity-of-Thought in a Standard Large Language Model
- URL: http://arxiv.org/abs/2302.07267v6
- Date: Wed, 13 Sep 2023 07:44:42 GMT
- Title: Diminished Diversity-of-Thought in a Standard Large Language Model
- Authors: Peter S. Park, Philipp Schoenegger, Chongyang Zhu
- Abstract summary: We run replications of 14 studies from the Many Labs 2 replication project with OpenAI's text-davinci-003 model.
We find that among the eight studies we could analyse, our GPT sample replicated 37.5% of the original results and 37.5% of the Many Labs 2 results.
In one exploratory follow-up study, we found that a "correct answer" was robust to changing the demographic details that precede the prompt.
- Score: 3.683202928838613
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We test whether Large Language Models (LLMs) can be used to simulate human
participants in social-science studies. To do this, we run replications of 14
studies from the Many Labs 2 replication project with OpenAI's text-davinci-003
model, colloquially known as GPT3.5. Based on our pre-registered analyses, we
find that among the eight studies we could analyse, our GPT sample replicated
37.5% of the original results and 37.5% of the Many Labs 2 results. However, we
were unable to analyse the remaining six studies due to an unexpected
phenomenon we call the "correct answer" effect. Different runs of GPT3.5
answered nuanced questions probing political orientation, economic preference,
judgement, and moral philosophy with zero or near-zero variation in responses:
with the supposedly "correct answer." In one exploratory follow-up study, we
found that a "correct answer" was robust to changing the demographic details
that precede the prompt. In another, we found that most but not all "correct
answers" were robust to changing the order of answer choices. One of our most
striking findings occurred in our replication of the Moral Foundations Theory
survey results, where we found GPT3.5 identifying as a political conservative
in 99.6% of the cases, and as a liberal in 99.3% of the cases in the
reverse-order condition. However, both self-reported 'GPT conservatives' and
'GPT liberals' showed right-leaning moral foundations. Our results cast doubts
on the validity of using LLMs as a general replacement for human participants
in the social sciences. Our results also raise concerns that a hypothetical
AI-led future may be subject to a diminished diversity-of-thought.
Related papers
- Is GPT-4 Less Politically Biased than GPT-3.5? A Renewed Investigation of ChatGPT's Political Biases [0.0]
This work investigates the political biases and personality traits of ChatGPT, specifically comparing GPT-3.5 to GPT-4.
The Political Compass Test and the Big Five Personality Test were employed 100 times for each scenario.
The responses were analyzed by computing averages, standard deviations, and performing significance tests to investigate differences between GPT-3.5 and GPT-4.
Correlations were found for traits that have been shown to be interdependent in human studies.
arXiv Detail & Related papers (2024-10-28T13:32:52Z) - Vox Populi, Vox AI? Using Language Models to Estimate German Public Opinion [45.84205238554709]
We generate a synthetic sample of personas matching the individual characteristics of the 2017 German Longitudinal Election Study respondents.
We ask the LLM GPT-3.5 to predict each respondent's vote choice and compare these predictions to the survey-based estimates.
We find that GPT-3.5 does not predict citizens' vote choice accurately, exhibiting a bias towards the Green and Left parties.
arXiv Detail & Related papers (2024-07-11T14:52:18Z) - Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis [0.27309692684728604]
We prompt OpenAI's flagship models, GPT-3.5 and GPT-4, to assume different personas and respond to a range of standardized measures of personality constructs.
We found that the responses from GPT-4, but not GPT-3.5, using generic persona descriptions show promising, albeit not perfect, psychometric properties.
arXiv Detail & Related papers (2024-05-12T10:52:15Z) - Large Language Models Show Human-like Social Desirability Biases in Survey Responses [12.767606361552684]
We show that Large Language Models (LLMs) skew their scores towards the desirable ends of trait dimensions when personality evaluation is inferred.
This bias exists in all tested models, including GPT-4/3.5, Claude 3, Llama 3, and PaLM-2.
reverse-coding all the questions decreases bias levels but does not eliminate them, suggesting that this effect cannot be attributed to acquiescence bias.
arXiv Detail & Related papers (2024-05-09T19:02:53Z) - Scaling Data Diversity for Fine-Tuning Language Models in Human Alignment [84.32768080422349]
Alignment with human preference prevents large language models from generating misleading or toxic content.
We propose a new formulation of prompt diversity, implying a linear correlation with the final performance of LLMs after fine-tuning.
arXiv Detail & Related papers (2024-03-17T07:08:55Z) - Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models [61.45529177682614]
We challenge the prevailing constrained evaluation paradigm for values and opinions in large language models.
We show that models give substantively different answers when not forced.
We distill these findings into recommendations and open challenges in evaluating values and opinions in LLMs.
arXiv Detail & Related papers (2024-02-26T18:00:49Z) - Behind the Screen: Investigating ChatGPT's Dark Personality Traits and
Conspiracy Beliefs [0.0]
This paper analyzes the dark personality traits and conspiracy beliefs of GPT-3.5 and GPT-4.
Dark personality traits and conspiracy beliefs were not particularly pronounced in either model.
arXiv Detail & Related papers (2024-02-06T16:03:57Z) - What Do Llamas Really Think? Revealing Preference Biases in Language
Model Representations [62.91799637259657]
Do large language models (LLMs) exhibit sociodemographic biases, even when they decline to respond?
We study this research question by probing contextualized embeddings and exploring whether this bias is encoded in its latent representations.
We propose a logistic Bradley-Terry probe which predicts word pair preferences of LLMs from the words' hidden vectors.
arXiv Detail & Related papers (2023-11-30T18:53:13Z) - Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs [67.51906565969227]
We study the unintended side-effects of persona assignment on the ability of LLMs to perform basic reasoning tasks.
Our study covers 24 reasoning datasets, 4 LLMs, and 19 diverse personas (e.g. an Asian person) spanning 5 socio-demographic groups.
arXiv Detail & Related papers (2023-11-08T18:52:17Z) - Large Language Models are not Fair Evaluators [60.27164804083752]
We find that the quality ranking of candidate responses can be easily hacked by altering their order of appearance in the context.
This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other.
We propose a framework with three simple yet effective strategies to mitigate this issue.
arXiv Detail & Related papers (2023-05-29T07:41:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.