Large language models cannot replace human participants because they
cannot portray identity groups
- URL: http://arxiv.org/abs/2402.01908v1
- Date: Fri, 2 Feb 2024 21:21:06 GMT
- Title: Large language models cannot replace human participants because they
cannot portray identity groups
- Authors: Angelina Wang and Jamie Morgenstern and John P. Dickerson
- Abstract summary: We argue that large language models (LLMs) are doomed to both misportray and flatten the representations of demographic groups.
We discuss a third consideration about how identity prompts can essentialize identities.
Overall, we urge caution in use cases where LLMs are intended to replace human participants whose identities are relevant to the task at hand.
- Score: 40.865099955752825
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are increasing in capability and popularity,
propelling their application in new domains -- including as replacements for
human participants in computational social science, user testing, annotation
tasks, and more. Traditionally, in all of these settings survey distributors
are careful to find representative samples of the human population to ensure
the validity of their results and understand potential demographic differences.
This means in order to be a suitable replacement, LLMs will need to be able to
capture the influence of positionality (i.e., relevance of social identities
like gender and race). However, we show that there are two inherent limitations
in the way current LLMs are trained that prevent this. We argue analytically
for why LLMs are doomed to both misportray and flatten the representations of
demographic groups, then empirically show this to be true on 4 LLMs through a
series of human studies with 3200 participants across 16 demographic
identities. We also discuss a third consideration about how identity prompts
can essentialize identities. Throughout, we connect each of these limitations
to a pernicious history that shows why each is harmful for marginalized
demographic groups. Overall, we urge caution in use cases where LLMs are
intended to replace human participants whose identities are relevant to the
task at hand. At the same time, in cases where the goal is to supplement rather
than replace (e.g., pilot studies), we provide empirically-better
inference-time techniques to reduce, but not remove, these harms.
Related papers
- Explicit and Implicit Large Language Model Personas Generate Opinions but Fail to Replicate Deeper Perceptions and Biases [14.650234624251716]
Large language models (LLMs) are increasingly being used in human-centered social scientific tasks.
These tasks are highly subjective and dependent on human factors, such as one's environment, attitudes, beliefs, and lived experiences.
We examine the role of prompting LLMs with human-like personas and ask the models to answer as if they were a specific human.
arXiv Detail & Related papers (2024-06-20T16:24:07Z) - How should the advent of large language models affect the practice of
science? [51.62881233954798]
How should the advent of large language models affect the practice of science?
We have invited four diverse groups of scientists to reflect on this query, sharing their perspectives and engaging in debate.
arXiv Detail & Related papers (2023-12-05T10:45:12Z) - Aligning with Whom? Large Language Models Have Gender and Racial Biases
in Subjective NLP Tasks [15.015148115215315]
We conduct experiments on four popular large language models (LLMs) to investigate their capability to understand group differences and potential biases in their predictions for politeness and offensiveness.
We find that for both tasks, model predictions are closer to the labels from White and female participants.
More specifically, when being prompted to respond from the perspective of "Black" and "Asian" individuals, models show lower performance in predicting both overall scores as well as the scores from corresponding groups.
arXiv Detail & Related papers (2023-11-16T10:02:24Z) - On the steerability of large language models toward data-driven personas [98.9138902560793]
Large language models (LLMs) are known to generate biased responses where the opinions of certain groups and populations are underrepresented.
Here, we present a novel approach to achieve controllable generation of specific viewpoints using LLMs.
arXiv Detail & Related papers (2023-11-08T19:01:13Z) - Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs [67.51906565969227]
We study the unintended side-effects of persona assignment on the ability of LLMs to perform basic reasoning tasks.
Our study covers 24 reasoning datasets, 4 LLMs, and 19 diverse personas (e.g. an Asian person) spanning 5 socio-demographic groups.
arXiv Detail & Related papers (2023-11-08T18:52:17Z) - Do LLMs exhibit human-like response biases? A case study in survey
design [66.1850490474361]
We investigate the extent to which large language models (LLMs) reflect human response biases, if at all.
We design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires.
Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior.
arXiv Detail & Related papers (2023-11-07T15:40:43Z) - Queer People are People First: Deconstructing Sexual Identity
Stereotypes in Large Language Models [3.974379576408554]
Large Language Models (LLMs) are trained primarily on minimally processed web text.
LLMs can inadvertently perpetuate stereotypes towards marginalized groups, like the LGBTQIA+ community.
arXiv Detail & Related papers (2023-06-30T19:39:01Z) - Revisiting the Reliability of Psychological Scales on Large Language
Models [66.31055885857062]
This study aims to determine the reliability of applying personality assessments to Large Language Models (LLMs)
By shedding light on the personalization of LLMs, our study endeavors to pave the way for future explorations in this field.
arXiv Detail & Related papers (2023-05-31T15:03:28Z) - Marked Personas: Using Natural Language Prompts to Measure Stereotypes
in Language Models [33.157279170602784]
We present Marked Personas, a prompt-based method to measure stereotypes in large language models (LLMs)
We find that portrayals generated by GPT-3.5 and GPT-4 contain higher rates of racial stereotypes than human-written portrayals using the same prompts.
An intersectional lens reveals tropes that dominate portrayals of marginalized groups, such as tropicalism and the hypersexualization of minoritized women.
arXiv Detail & Related papers (2023-05-29T16:29:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.