Questioning the Survey Responses of Large Language Models
- URL: http://arxiv.org/abs/2306.07951v4
- Date: Mon, 09 Dec 2024 10:47:06 GMT
- Title: Questioning the Survey Responses of Large Language Models
- Authors: Ricardo Dominguez-Olmedo, Moritz Hardt, Celestine Mendler-Dünner,
- Abstract summary: We critically examine the methodology on the basis of the well-established American Community Survey by the U.S. Census Bureau.
We establish two dominant patterns. First, models' responses are governed by ordering and labeling biases, for example, towards survey responses labeled with the letter "A"
Second, when adjusting for these systematic biases through randomized answer ordering, models across the board trend towards uniformly random survey responses.
- Score: 25.14481433176348
- License:
- Abstract: Surveys have recently gained popularity as a tool to study large language models. By comparing survey responses of models to those of human reference populations, researchers aim to infer the demographics, political opinions, or values best represented by current language models. In this work, we critically examine this methodology on the basis of the well-established American Community Survey by the U.S. Census Bureau. Evaluating 43 different language models using de-facto standard prompting methodologies, we establish two dominant patterns. First, models' responses are governed by ordering and labeling biases, for example, towards survey responses labeled with the letter "A". Second, when adjusting for these systematic biases through randomized answer ordering, models across the board trend towards uniformly random survey responses, irrespective of model size or pre-training data. As a result, in contrast to conjectures from prior work, survey-derived alignment measures often permit a simple explanation: models consistently appear to better represent subgroups whose aggregate statistics are closest to uniform for any survey under consideration.
Related papers
- Specializing Large Language Models to Simulate Survey Response Distributions for Global Populations [49.908708778200115]
We are the first to specialize large language models (LLMs) for simulating survey response distributions.
As a testbed, we use country-level results from two global cultural surveys.
We devise a fine-tuning method based on first-token probabilities to minimize divergence between predicted and actual response distributions.
arXiv Detail & Related papers (2025-02-10T21:59:27Z) - Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models [50.40276881893513]
This study introduces Spoken Stereoset, a dataset specifically designed to evaluate social biases in Speech Large Language Models (SLLMs)
By examining how different models respond to speech from diverse demographic groups, we aim to identify these biases.
The findings indicate that while most models show minimal bias, some still exhibit slightly stereotypical or anti-stereotypical tendencies.
arXiv Detail & Related papers (2024-08-14T16:55:06Z) - Uncertainty Estimation of Large Language Models in Medical Question Answering [60.72223137560633]
Large Language Models (LLMs) show promise for natural language generation in healthcare, but risk hallucinating factually incorrect information.
We benchmark popular uncertainty estimation (UE) methods with different model sizes on medical question-answering datasets.
Our results show that current approaches generally perform poorly in this domain, highlighting the challenge of UE for medical applications.
arXiv Detail & Related papers (2024-07-11T16:51:33Z) - Forcing Diffuse Distributions out of Language Models [70.28345569190388]
Despite being trained specifically to follow user instructions, today's instructiontuned language models perform poorly when instructed to produce random outputs.
We propose a fine-tuning method that encourages language models to output distributions that are diffuse over valid outcomes.
arXiv Detail & Related papers (2024-04-16T19:17:23Z) - Leveraging Prototypical Representations for Mitigating Social Bias without Demographic Information [50.29934517930506]
DAFair is a novel approach to address social bias in language models.
We leverage prototypical demographic texts and incorporate a regularization term during the fine-tuning process to mitigate bias.
arXiv Detail & Related papers (2024-03-14T15:58:36Z) - Random Silicon Sampling: Simulating Human Sub-Population Opinion Using a
Large Language Model Based on Group-Level Demographic Information [15.435605802794408]
Large language models exhibit societal biases associated with demographic information.
We propose "random silicon sampling," a method to emulate the opinions of the human population sub-group.
We find that language models can generate response distributions remarkably similar to the actual U.S. public opinion polls.
arXiv Detail & Related papers (2024-02-28T08:09:14Z) - Exposing Bias in Online Communities through Large-Scale Language Models [3.04585143845864]
This work uses the flaw of bias in language models to explore the biases of six different online communities.
The bias of the resulting models is evaluated by prompting the models with different demographics and comparing the sentiment and toxicity values of these generations.
This work not only affirms how easily bias is absorbed from training data but also presents a scalable method to identify and compare the bias of different datasets or communities.
arXiv Detail & Related papers (2023-06-04T08:09:26Z) - This Prompt is Measuring <MASK>: Evaluating Bias Evaluation in Language
Models [12.214260053244871]
We analyse the body of work that uses prompts and templates to assess bias in language models.
We draw on a measurement modelling framework to create a taxonomy of attributes that capture what a bias test aims to measure.
Our analysis illuminates the scope of possible bias types the field is able to measure, and reveals types that are as yet under-researched.
arXiv Detail & Related papers (2023-05-22T06:28:48Z) - Open vs Closed-ended questions in attitudinal surveys -- comparing,
combining, and interpreting using natural language processing [3.867363075280544]
Topic Modeling could significantly reduce the time to extract information from open-ended responses.
Our research uses Topic Modeling to extract information from open-ended questions and compare its performance with closed-ended responses.
arXiv Detail & Related papers (2022-05-03T06:01:03Z) - UnQovering Stereotyping Biases via Underspecified Questions [68.81749777034409]
We present UNQOVER, a framework to probe and quantify biases through underspecified questions.
We show that a naive use of model scores can lead to incorrect bias estimates due to two forms of reasoning errors.
We use this metric to analyze four important classes of stereotypes: gender, nationality, ethnicity, and religion.
arXiv Detail & Related papers (2020-10-06T01:49:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.