Questioning the Survey Responses of Large Language Models
- URL: http://arxiv.org/abs/2306.07951v3
- Date: Wed, 28 Feb 2024 12:37:53 GMT
- Title: Questioning the Survey Responses of Large Language Models
- Authors: Ricardo Dominguez-Olmedo, Moritz Hardt, Celestine Mendler-D\"unner
- Abstract summary: We critically examine language models' survey responses on the basis of the well-established American Community Survey by the U.S. Census Bureau.
We find that models' responses are governed by ordering and labeling biases, leading to variations across models that do not persist after adjusting for systematic biases.
Our findings suggest caution in treating models' survey responses as equivalent to those of human populations.
- Score: 18.61486375469644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models increase in capability, researchers have started to
conduct surveys of all kinds on these models in order to investigate the
population represented by their responses. In this work, we critically examine
language models' survey responses on the basis of the well-established American
Community Survey by the U.S. Census Bureau and investigate whether they elicit
a faithful representations of any human population. Using a de-facto standard
multiple-choice prompting technique and evaluating 39 different language models
using systematic experiments, we establish two dominant patterns: First,
models' responses are governed by ordering and labeling biases, leading to
variations across models that do not persist after adjusting for systematic
biases. Second, models' responses do not contain the entropy variations and
statistical signals typically found in human populations. As a result, a binary
classifier can almost perfectly differentiate model-generated data from the
responses of the U.S. census. At the same time, models' relative alignment with
different demographic subgroups can be predicted from the subgroups' entropy,
irrespective of the model's training data or training strategy. Taken together,
our findings suggest caution in treating models' survey responses as equivalent
to those of human populations.
Related papers
- Specializing Large Language Models to Simulate Survey Response Distributions for Global Populations [49.908708778200115]
We are the first to specialize large language models (LLMs) for simulating survey response distributions.
As a testbed, we use country-level results from two global cultural surveys.
We devise a fine-tuning method based on first-token probabilities to minimize divergence between predicted and actual response distributions.
arXiv Detail & Related papers (2025-02-10T21:59:27Z) - Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models [50.40276881893513]
This study introduces Spoken Stereoset, a dataset specifically designed to evaluate social biases in Speech Large Language Models (SLLMs)
By examining how different models respond to speech from diverse demographic groups, we aim to identify these biases.
The findings indicate that while most models show minimal bias, some still exhibit slightly stereotypical or anti-stereotypical tendencies.
arXiv Detail & Related papers (2024-08-14T16:55:06Z) - Uncertainty Estimation of Large Language Models in Medical Question Answering [60.72223137560633]
Large Language Models (LLMs) show promise for natural language generation in healthcare, but risk hallucinating factually incorrect information.
We benchmark popular uncertainty estimation (UE) methods with different model sizes on medical question-answering datasets.
Our results show that current approaches generally perform poorly in this domain, highlighting the challenge of UE for medical applications.
arXiv Detail & Related papers (2024-07-11T16:51:33Z) - Forcing Diffuse Distributions out of Language Models [70.28345569190388]
Despite being trained specifically to follow user instructions, today's instructiontuned language models perform poorly when instructed to produce random outputs.
We propose a fine-tuning method that encourages language models to output distributions that are diffuse over valid outcomes.
arXiv Detail & Related papers (2024-04-16T19:17:23Z) - Leveraging Prototypical Representations for Mitigating Social Bias without Demographic Information [50.29934517930506]
DAFair is a novel approach to address social bias in language models.
We leverage prototypical demographic texts and incorporate a regularization term during the fine-tuning process to mitigate bias.
arXiv Detail & Related papers (2024-03-14T15:58:36Z) - Random Silicon Sampling: Simulating Human Sub-Population Opinion Using a
Large Language Model Based on Group-Level Demographic Information [15.435605802794408]
Large language models exhibit societal biases associated with demographic information.
We propose "random silicon sampling," a method to emulate the opinions of the human population sub-group.
We find that language models can generate response distributions remarkably similar to the actual U.S. public opinion polls.
arXiv Detail & Related papers (2024-02-28T08:09:14Z) - Exposing Bias in Online Communities through Large-Scale Language Models [3.04585143845864]
This work uses the flaw of bias in language models to explore the biases of six different online communities.
The bias of the resulting models is evaluated by prompting the models with different demographics and comparing the sentiment and toxicity values of these generations.
This work not only affirms how easily bias is absorbed from training data but also presents a scalable method to identify and compare the bias of different datasets or communities.
arXiv Detail & Related papers (2023-06-04T08:09:26Z) - This Prompt is Measuring <MASK>: Evaluating Bias Evaluation in Language
Models [12.214260053244871]
We analyse the body of work that uses prompts and templates to assess bias in language models.
We draw on a measurement modelling framework to create a taxonomy of attributes that capture what a bias test aims to measure.
Our analysis illuminates the scope of possible bias types the field is able to measure, and reveals types that are as yet under-researched.
arXiv Detail & Related papers (2023-05-22T06:28:48Z) - Open vs Closed-ended questions in attitudinal surveys -- comparing,
combining, and interpreting using natural language processing [3.867363075280544]
Topic Modeling could significantly reduce the time to extract information from open-ended responses.
Our research uses Topic Modeling to extract information from open-ended questions and compare its performance with closed-ended responses.
arXiv Detail & Related papers (2022-05-03T06:01:03Z) - UnQovering Stereotyping Biases via Underspecified Questions [68.81749777034409]
We present UNQOVER, a framework to probe and quantify biases through underspecified questions.
We show that a naive use of model scores can lead to incorrect bias estimates due to two forms of reasoning errors.
We use this metric to analyze four important classes of stereotypes: gender, nationality, ethnicity, and religion.
arXiv Detail & Related papers (2020-10-06T01:49:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.