Related papers: Exploring Social Desirability Response Bias in Large Language Models: Evidence from GPT-4 Simulations

Exploring Social Desirability Response Bias in Large Language Models: Evidence from GPT-4 Simulations

URL: http://arxiv.org/abs/2410.15442v1
Date: Sun, 20 Oct 2024 16:28:24 GMT
Title: Exploring Social Desirability Response Bias in Large Language Models: Evidence from GPT-4 Simulations
Authors: Sanguk Lee, Kai-Qi Yang, Tai-Quan Peng, Ruth Heo, Hui Liu,
Abstract summary: Large language models (LLMs) are employed to simulate human-like responses in social surveys. It remains unclear if they develop biases like social desirability response (SDR) bias. The study underscores potential avenues for using LLMs to investigate biases in both humans and LLMs themselves.
Score: 4.172974580485295
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models (LLMs) are employed to simulate human-like responses in social surveys, yet it remains unclear if they develop biases like social desirability response (SDR) bias. To investigate this, GPT-4 was assigned personas from four societies, using data from the 2022 Gallup World Poll. These synthetic samples were then prompted with or without a commitment statement intended to induce SDR. The results were mixed. While the commitment statement increased SDR index scores, suggesting SDR bias, it reduced civic engagement scores, indicating an opposite trend. Additional findings revealed demographic associations with SDR scores and showed that the commitment statement had limited impact on GPT-4's predictive performance. The study underscores potential avenues for using LLMs to investigate biases in both humans and LLMs themselves.

Related papers

Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare [7.075750841525739]
Generative agents have been increasingly used to simulate human behaviour in silico, driven by large language models (LLMs) This work compares survey data from the Understanding America Study (UAS) on healthcare decision-making with simulated responses from generative agents. Using demographic-based prompt engineering, we create digital twins of survey respondents and analyse how well different LLMs reproduce real-world behaviours.
arXiv Detail & Related papers (2025-04-11T05:11:40Z)
How far can bias go? -- Tracing bias from pretraining data to alignment [54.51310112013655]
This study examines the correlation between gender-occupation bias in pre-training data and their manifestation in LLMs. Our findings reveal that biases present in pre-training data are amplified in model outputs.
arXiv Detail & Related papers (2024-11-28T16:20:25Z)
ChatGPT vs Social Surveys: Probing the Objective and Subjective Human Society [7.281887764378982]
We used ChatGPT-3.5 to simulate the sampling process and generated six socioeconomic characteristics from the 2020 US population. We analyzed responses to questions about income inequality and gender roles to explore GPT's subjective attitudes. Our findings show some alignment in gender and age means with the actual 2020 US population, but we also found mismatches in the distributions of racial and educational groups.
arXiv Detail & Related papers (2024-09-04T10:33:37Z)
Vox Populi, Vox AI? Using Language Models to Estimate German Public Opinion [45.84205238554709]
We generate a synthetic sample of personas matching the individual characteristics of the 2017 German Longitudinal Election Study respondents. We ask the LLM GPT-3.5 to predict each respondent's vote choice and compare these predictions to the survey-based estimates. We find that GPT-3.5 does not predict citizens' vote choice accurately, exhibiting a bias towards the Green and Left parties.
arXiv Detail & Related papers (2024-07-11T14:52:18Z)
Large Language Models Show Human-like Social Desirability Biases in Survey Responses [12.767606361552684]
We show that Large Language Models (LLMs) skew their scores towards the desirable ends of trait dimensions when personality evaluation is inferred. This bias exists in all tested models, including GPT-4/3.5, Claude 3, Llama 3, and PaLM-2. reverse-coding all the questions decreases bias levels but does not eliminate them, suggesting that this effect cannot be attributed to acquiescence bias.
arXiv Detail & Related papers (2024-05-09T19:02:53Z)
Evaluating LLMs for Gender Disparities in Notable Persons [0.40964539027092906]
This study examines the use of Large Language Models (LLMs) for retrieving factual information. It addresses concerns over their propensity to produce factually incorrect "hallucinated" responses or to altogether decline to answer prompt at all.
arXiv Detail & Related papers (2024-03-14T07:58:27Z)
Exploring Value Biases: How LLMs Deviate Towards the Ideal [57.99044181599786]
Large-Language-Models (LLMs) are deployed in a wide range of applications, and their response has an increasing social impact. We show that value bias is strong in LLMs across different categories, similar to the results found in human studies.
arXiv Detail & Related papers (2024-02-16T18:28:43Z)
Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs [67.51906565969227]
We study the unintended side-effects of persona assignment on the ability of LLMs to perform basic reasoning tasks. Our study covers 24 reasoning datasets, 4 LLMs, and 19 diverse personas (e.g. an Asian person) spanning 5 socio-demographic groups.
arXiv Detail & Related papers (2023-11-08T18:52:17Z)
Do LLMs exhibit human-like response biases? A case study in survey design [66.1850490474361]
We investigate the extent to which large language models (LLMs) reflect human response biases, if at all. We design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior.
arXiv Detail & Related papers (2023-11-07T15:40:43Z)
Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context. Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS) Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z)
The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Benchmarks [75.58692290694452]
We compare social biases with non-social biases stemming from choices made during dataset construction that might not even be discernible to the human eye. We observe that these shallow modifications have a surprising effect on the resulting degree of bias across various models.
arXiv Detail & Related papers (2022-10-18T17:58:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.