What Do Llamas Really Think? Revealing Preference Biases in Language
Model Representations
- URL: http://arxiv.org/abs/2311.18812v1
- Date: Thu, 30 Nov 2023 18:53:13 GMT
- Title: What Do Llamas Really Think? Revealing Preference Biases in Language
Model Representations
- Authors: Raphael Tang, Xinyu Zhang, Jimmy Lin, Ferhan Ture
- Abstract summary: Do large language models (LLMs) exhibit sociodemographic biases, even when they decline to respond?
We study this research question by probing contextualized embeddings and exploring whether this bias is encoded in its latent representations.
We propose a logistic Bradley-Terry probe which predicts word pair preferences of LLMs from the words' hidden vectors.
- Score: 62.91799637259657
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Do large language models (LLMs) exhibit sociodemographic biases, even when
they decline to respond? To bypass their refusal to "speak," we study this
research question by probing contextualized embeddings and exploring whether
this bias is encoded in its latent representations. We propose a logistic
Bradley-Terry probe which predicts word pair preferences of LLMs from the
words' hidden vectors. We first validate our probe on three pair preference
tasks and thirteen LLMs, where we outperform the word embedding association
test (WEAT), a standard approach in testing for implicit association, by a
relative 27% in error rate. We also find that word pair preferences are best
represented in the middle layers. Next, we transfer probes trained on harmless
tasks (e.g., pick the larger number) to controversial ones (compare
ethnicities) to examine biases in nationality, politics, religion, and gender.
We observe substantial bias for all target classes: for instance, the Mistral
model implicitly prefers Europe to Africa, Christianity to Judaism, and
left-wing to right-wing politics, despite declining to answer. This suggests
that instruction fine-tuning does not necessarily debias contextualized
embeddings. Our codebase is at https://github.com/castorini/biasprobe.
Related papers
- Profiling Bias in LLMs: Stereotype Dimensions in Contextual Word Embeddings [1.5379084885764847]
Large language models (LLMs) are the foundation of the current successes of artificial intelligence (AI)
To effectively communicate the risks and encourage mitigation efforts these models need adequate and intuitive descriptions of their discriminatory properties.
We suggest bias profiles with respect to stereotype dimensions based on dictionaries from social psychology research.
arXiv Detail & Related papers (2024-11-25T16:14:45Z) - One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [55.35278531907263]
We present the first study on Large Language Models' fairness and robustness to a dialect in canonical reasoning tasks.
We hire AAVE speakers to rewrite seven popular benchmarks, such as HumanEval and GSM8K.
We find that, compared to Standardized English, almost all of these widely used models show significant brittleness and unfairness to queries in AAVE.
arXiv Detail & Related papers (2024-10-14T18:44:23Z) - BiasDPO: Mitigating Bias in Language Models through Direct Preference Optimization [0.0]
Large Language Models (LLMs) have become pivotal in advancing natural language processing, yet their potential to perpetuate biases poses significant concerns.
This paper introduces a new framework employing Direct Preference Optimization (DPO) to mitigate gender, racial, and religious biases in English text.
By developing a loss function that favors less biased over biased completions, our approach cultivates a preference for respectful and non-discriminatory language.
arXiv Detail & Related papers (2024-07-18T22:32:20Z) - White Men Lead, Black Women Help? Benchmarking Language Agency Social Biases in LLMs [58.27353205269664]
Social biases can manifest in language agency.
We introduce the novel Language Agency Bias Evaluation benchmark.
We unveil language agency social biases in 3 recent Large Language Model (LLM)-generated content.
arXiv Detail & Related papers (2024-04-16T12:27:54Z) - Disclosure and Mitigation of Gender Bias in LLMs [64.79319733514266]
Large Language Models (LLMs) can generate biased responses.
We propose an indirect probing framework based on conditional generation.
We explore three distinct strategies to disclose explicit and implicit gender bias in LLMs.
arXiv Detail & Related papers (2024-02-17T04:48:55Z) - Cognitive Dissonance: Why Do Language Model Outputs Disagree with
Internal Representations of Truthfulness? [53.98071556805525]
Neural language models (LMs) can be used to evaluate the truth of factual statements.
They can be queried for statement probabilities, or probed for internal representations of truthfulness.
Past work has found that these two procedures sometimes disagree, and that probes tend to be more accurate than LM outputs.
This has led some researchers to conclude that LMs "lie" or otherwise encode non-cooperative communicative intents.
arXiv Detail & Related papers (2023-11-27T18:59:14Z) - Aligning with Whom? Large Language Models Have Gender and Racial Biases
in Subjective NLP Tasks [15.015148115215315]
We conduct experiments on four popular large language models (LLMs) to investigate their capability to understand group differences and potential biases in their predictions for politeness and offensiveness.
We find that for both tasks, model predictions are closer to the labels from White and female participants.
More specifically, when being prompted to respond from the perspective of "Black" and "Asian" individuals, models show lower performance in predicting both overall scores as well as the scores from corresponding groups.
arXiv Detail & Related papers (2023-11-16T10:02:24Z) - "Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in
LLM-Generated Reference Letters [97.11173801187816]
Large Language Models (LLMs) have recently emerged as an effective tool to assist individuals in writing various types of content.
This paper critically examines gender biases in LLM-generated reference letters.
arXiv Detail & Related papers (2023-10-13T16:12:57Z) - OpinionGPT: Modelling Explicit Biases in Instruction-Tuned LLMs [3.5342505775640247]
We present OpinionGPT, a web demo in which users can ask questions and select all biases they wish to investigate.
The demo will answer this question using a model fine-tuned on text representing each of the selected biases.
To train the underlying model, we identified 11 different biases (political, geographic, gender, age) and derived an instruction-tuning corpus in which each answer was written by members of one of these demographics.
arXiv Detail & Related papers (2023-09-07T17:41:01Z) - Language-Agnostic Bias Detection in Language Models with Bias Probing [22.695872707061078]
Pretrained language models (PLMs) are key components in NLP, but they contain strong social biases.
We propose a bias probing technique called LABDet for evaluating social bias in PLMs with a robust and language-agnostic method.
We find consistent patterns of nationality bias across monolingual PLMs in six languages that align with historical and political context.
arXiv Detail & Related papers (2023-05-22T17:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.