Related papers: What Do Llamas Really Think? Revealing Preference Biases in Language Model Representations

What Do Llamas Really Think? Revealing Preference Biases in Language Model Representations

URL: http://arxiv.org/abs/2311.18812v1
Date: Thu, 30 Nov 2023 18:53:13 GMT
Title: What Do Llamas Really Think? Revealing Preference Biases in Language Model Representations
Authors: Raphael Tang, Xinyu Zhang, Jimmy Lin, Ferhan Ture
Abstract summary: Do large language models (LLMs) exhibit sociodemographic biases, even when they decline to respond? We study this research question by probing contextualized embeddings and exploring whether this bias is encoded in its latent representations. We propose a logistic Bradley-Terry probe which predicts word pair preferences of LLMs from the words' hidden vectors.
Score: 62.91799637259657
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Do large language models (LLMs) exhibit sociodemographic biases, even when they decline to respond? To bypass their refusal to "speak," we study this research question by probing contextualized embeddings and exploring whether this bias is encoded in its latent representations. We propose a logistic Bradley-Terry probe which predicts word pair preferences of LLMs from the words' hidden vectors. We first validate our probe on three pair preference tasks and thirteen LLMs, where we outperform the word embedding association test (WEAT), a standard approach in testing for implicit association, by a relative 27% in error rate. We also find that word pair preferences are best represented in the middle layers. Next, we transfer probes trained on harmless tasks (e.g., pick the larger number) to controversial ones (compare ethnicities) to examine biases in nationality, politics, religion, and gender. We observe substantial bias for all target classes: for instance, the Mistral model implicitly prefers Europe to Africa, Christianity to Judaism, and left-wing to right-wing politics, despite declining to answer. This suggests that instruction fine-tuning does not necessarily debias contextualized embeddings. Our codebase is at https://github.com/castorini/biasprobe.

Related papers

Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models [49.41113560646115]
We investigate various proxy measures of bias in large language models (LLMs)<n>We find that evaluating models with pre-prompted personae on a multi-subject benchmark (MMLU) leads to negligible and mostly random differences in scores.<n>With the recent trend for LLM assistant memory and personalization, these problems open up from a different angle.
arXiv Detail & Related papers (2025-06-12T08:47:40Z)
Colombian Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Recommendations from LLMs [15.783346695504344]
We study the first study of multilingual intersecting country and gender biases.<n>We construct a benchmark of prompts in English, Spanish and German, using 25 countries and four pronoun sets.<n>We find that even when models show parity for gender or country individually, intersectional occupational biases based on both country and gender persist.
arXiv Detail & Related papers (2025-05-05T08:40:51Z)
Implicit Bias in LLMs: A Survey [2.07180164747172]
This paper provides a comprehensive review of the existing literature on implicit bias in Large language models. We begin by introducing key concepts, theories and methods related to implicit bias in psychology. We categorize detection methods into three primary approaches: word association, task-oriented text generation and decision-making.
arXiv Detail & Related papers (2025-03-04T16:49:37Z)
Fact-or-Fair: A Checklist for Behavioral Testing of AI Models on Fairness-Related Queries [85.909363478929]
In this study, we focus on 19 real-world statistics collected from authoritative sources. We develop a checklist comprising objective and subjective queries to analyze behavior of large language models. We propose metrics to assess factuality and fairness, and formally prove the inherent trade-off between these two aspects.
arXiv Detail & Related papers (2025-02-09T10:54:11Z)
Profiling Bias in LLMs: Stereotype Dimensions in Contextual Word Embeddings [1.5379084885764847]
Large language models (LLMs) are the foundation of the current successes of artificial intelligence (AI) To effectively communicate the risks and encourage mitigation efforts these models need adequate and intuitive descriptions of their discriminatory properties. We suggest bias profiles with respect to stereotype dimensions based on dictionaries from social psychology research.
arXiv Detail & Related papers (2024-11-25T16:14:45Z)
One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [55.35278531907263]
We present the first study on Large Language Models' fairness and robustness to a dialect in canonical reasoning tasks. We hire AAVE speakers to rewrite seven popular benchmarks, such as HumanEval and GSM8K. We find that, compared to Standardized English, almost all of these widely used models show significant brittleness and unfairness to queries in AAVE.
arXiv Detail & Related papers (2024-10-14T18:44:23Z)
BiasDPO: Mitigating Bias in Language Models through Direct Preference Optimization [0.0]
Large Language Models (LLMs) have become pivotal in advancing natural language processing, yet their potential to perpetuate biases poses significant concerns. This paper introduces a new framework employing Direct Preference Optimization (DPO) to mitigate gender, racial, and religious biases in English text. By developing a loss function that favors less biased over biased completions, our approach cultivates a preference for respectful and non-discriminatory language.
arXiv Detail & Related papers (2024-07-18T22:32:20Z)
White Men Lead, Black Women Help? Benchmarking Language Agency Social Biases in LLMs [58.27353205269664]
Social biases can manifest in language agency. We introduce the novel Language Agency Bias Evaluation benchmark. We unveil language agency social biases in 3 recent Large Language Model (LLM)-generated content.
arXiv Detail & Related papers (2024-04-16T12:27:54Z)
Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation [49.3814117521631]
Standard benchmarks of bias and fairness in large language models (LLMs) measure the association between social attributes implied in user prompts and short responses. We develop analogous RUTEd evaluations from three contexts of real-world use. We find that standard bias metrics have no significant correlation with the more realistic bias metrics.
arXiv Detail & Related papers (2024-02-20T01:49:15Z)
Disclosure and Mitigation of Gender Bias in LLMs [64.79319733514266]
Large Language Models (LLMs) can generate biased responses. We propose an indirect probing framework based on conditional generation. We explore three distinct strategies to disclose explicit and implicit gender bias in LLMs.
arXiv Detail & Related papers (2024-02-17T04:48:55Z)
Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness? [53.98071556805525]
Neural language models (LMs) can be used to evaluate the truth of factual statements. They can be queried for statement probabilities, or probed for internal representations of truthfulness. Past work has found that these two procedures sometimes disagree, and that probes tend to be more accurate than LM outputs. This has led some researchers to conclude that LMs "lie" or otherwise encode non-cooperative communicative intents.
arXiv Detail & Related papers (2023-11-27T18:59:14Z)
Aligning with Whom? Large Language Models Have Gender and Racial Biases in Subjective NLP Tasks [15.015148115215315]
We conduct experiments on four popular large language models (LLMs) to investigate their capability to understand group differences and potential biases in their predictions for politeness and offensiveness. We find that for both tasks, model predictions are closer to the labels from White and female participants. More specifically, when being prompted to respond from the perspective of "Black" and "Asian" individuals, models show lower performance in predicting both overall scores as well as the scores from corresponding groups.
arXiv Detail & Related papers (2023-11-16T10:02:24Z)
"Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference Letters [97.11173801187816]
Large Language Models (LLMs) have recently emerged as an effective tool to assist individuals in writing various types of content. This paper critically examines gender biases in LLM-generated reference letters.
arXiv Detail & Related papers (2023-10-13T16:12:57Z)
OpinionGPT: Modelling Explicit Biases in Instruction-Tuned LLMs [3.5342505775640247]
We present OpinionGPT, a web demo in which users can ask questions and select all biases they wish to investigate. The demo will answer this question using a model fine-tuned on text representing each of the selected biases. To train the underlying model, we identified 11 different biases (political, geographic, gender, age) and derived an instruction-tuning corpus in which each answer was written by members of one of these demographics.
arXiv Detail & Related papers (2023-09-07T17:41:01Z)
Language-Agnostic Bias Detection in Language Models with Bias Probing [22.695872707061078]
Pretrained language models (PLMs) are key components in NLP, but they contain strong social biases. We propose a bias probing technique called LABDet for evaluating social bias in PLMs with a robust and language-agnostic method. We find consistent patterns of nationality bias across monolingual PLMs in six languages that align with historical and political context.
arXiv Detail & Related papers (2023-05-22T17:58:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.