Sociodemographic Prompting is Not Yet an Effective Approach for Simulating Subjective Judgments with LLMs
- URL: http://arxiv.org/abs/2311.09730v2
- Date: Mon, 17 Feb 2025 17:46:03 GMT
- Title: Sociodemographic Prompting is Not Yet an Effective Approach for Simulating Subjective Judgments with LLMs
- Authors: Huaman Sun, Jiaxin Pei, Minje Choi, David Jurgens,
- Abstract summary: Large Language Models (LLMs) are widely used to simulate human responses across diverse contexts.
We evaluate nine popular LLMs on their ability to understand demographic differences in two subjective judgment tasks: politeness and offensiveness.
We find that in zero-shot settings, most models' predictions for both tasks align more closely with labels from White participants than those from Asian or Black participants.
- Score: 13.744746481528711
- License:
- Abstract: Human judgments are inherently subjective and are actively affected by personal traits such as gender and ethnicity. While Large Language Models (LLMs) are widely used to simulate human responses across diverse contexts, their ability to account for demographic differences in subjective tasks remains uncertain. In this study, leveraging the POPQUORN dataset, we evaluate nine popular LLMs on their ability to understand demographic differences in two subjective judgment tasks: politeness and offensiveness. We find that in zero-shot settings, most models' predictions for both tasks align more closely with labels from White participants than those from Asian or Black participants, while only a minor gender bias favoring women appears in the politeness task. Furthermore, sociodemographic prompting does not consistently improve and, in some cases, worsens LLMs' ability to perceive language from specific sub-populations. These findings highlight potential demographic biases in LLMs when performing subjective judgment tasks and underscore the limitations of sociodemographic prompting as a strategy to achieve pluralistic alignment. Code and data are available at: https://github.com/Jiaxin-Pei/LLM-as-Subjective-Judge.
Related papers
- Who Does the Giant Number Pile Like Best: Analyzing Fairness in Hiring Contexts [5.111540255111445]
Race-based differences appear in approximately 10% of generated summaries, while gender-based differences occur in only 1%.
Retrieval models demonstrate comparable sensitivity to non-demographic changes, suggesting that fairness issues may stem from general brittleness issues.
arXiv Detail & Related papers (2025-01-08T07:28:10Z) - Robustness and Confounders in the Demographic Alignment of LLMs with Human Perceptions of Offensiveness [10.194622474615462]
Large language models (LLMs) are known to exhibit demographic biases, yet few studies systematically evaluate these biases across multiple datasets or account for confounding factors.
Our findings reveal that while demographic traits, particularly race, influence alignment, these effects are inconsistent across datasets and often entangled with other factors.
arXiv Detail & Related papers (2024-11-13T19:08:23Z) - Are Large Language Models Ready for Travel Planning? [6.307444995285539]
Large language models (LLMs) show promise in hospitality and tourism, their ability to provide unbiased service across demographic groups remains unclear.
This paper explores gender and ethnic biases when LLMs are utilized as travel planning assistants.
arXiv Detail & Related papers (2024-10-22T18:08:25Z) - Hate Personified: Investigating the role of LLMs in content moderation [64.26243779985393]
For subjective tasks such as hate detection, where people perceive hate differently, the Large Language Model's (LLM) ability to represent diverse groups is unclear.
By including additional context in prompts, we analyze LLM's sensitivity to geographical priming, persona attributes, and numerical information to assess how well the needs of various groups are reflected.
arXiv Detail & Related papers (2024-10-03T16:43:17Z) - GenderCARE: A Comprehensive Framework for Assessing and Reducing Gender Bias in Large Language Models [73.23743278545321]
Large language models (LLMs) have exhibited remarkable capabilities in natural language generation, but have also been observed to magnify societal biases.
GenderCARE is a comprehensive framework that encompasses innovative Criteria, bias Assessment, Reduction techniques, and Evaluation metrics.
arXiv Detail & Related papers (2024-08-22T15:35:46Z) - GenderBias-\emph{VL}: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing [72.0343083866144]
This paper introduces the GenderBias-emphVL benchmark to evaluate occupation-related gender bias in Large Vision-Language Models.
Using our benchmark, we extensively evaluate 15 commonly used open-source LVLMs and state-of-the-art commercial APIs.
Our findings reveal widespread gender biases in existing LVLMs.
arXiv Detail & Related papers (2024-06-30T05:55:15Z) - Evaluating Gender Bias in Large Language Models via Chain-of-Thought
Prompting [87.30837365008931]
Large language models (LLMs) equipped with Chain-of-Thought (CoT) prompting are able to make accurate incremental predictions even on unscalable tasks.
This study examines the impact of LLMs' step-by-step predictions on gender bias in unscalable tasks.
arXiv Detail & Related papers (2024-01-28T06:50:10Z) - On the steerability of large language models toward data-driven personas [98.9138902560793]
Large language models (LLMs) are known to generate biased responses where the opinions of certain groups and populations are underrepresented.
Here, we present a novel approach to achieve controllable generation of specific viewpoints using LLMs.
arXiv Detail & Related papers (2023-11-08T19:01:13Z) - Investigating Subtler Biases in LLMs: Ageism, Beauty, Institutional, and Nationality Bias in Generative Models [0.0]
This paper investigates bias along less-studied but still consequential, dimensions, such as age and beauty.
We ask whether LLMs hold wide-reaching biases of positive or negative sentiment for specific social groups similar to the "what is beautiful is good" bias found in people in experimental psychology.
arXiv Detail & Related papers (2023-09-16T07:07:04Z) - Sensitivity, Performance, Robustness: Deconstructing the Effect of
Sociodemographic Prompting [64.80538055623842]
sociodemographic prompting is a technique that steers the output of prompt-based models towards answers that humans with specific sociodemographic profiles would give.
We show that sociodemographic information affects model predictions and can be beneficial for improving zero-shot learning in subjective NLP tasks.
arXiv Detail & Related papers (2023-09-13T15:42:06Z) - The Unequal Opportunities of Large Language Models: Revealing
Demographic Bias through Job Recommendations [5.898806397015801]
We propose a simple method for analyzing and comparing demographic bias in Large Language Models (LLMs)
We demonstrate the effectiveness of our method by measuring intersectional biases within ChatGPT and LLaMA.
We identify distinct biases in both models toward various demographic identities, such as both models consistently suggesting low-paying jobs for Mexican workers.
arXiv Detail & Related papers (2023-08-03T21:12:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.