Related papers: Aligned but Blind: Alignment Increases Implicit Bias by Reducing Awareness of Race

Aligned but Blind: Alignment Increases Implicit Bias by Reducing Awareness of Race

URL: http://arxiv.org/abs/2506.00253v3
Date: Sun, 08 Jun 2025 23:37:10 GMT
Title: Aligned but Blind: Alignment Increases Implicit Bias by Reducing Awareness of Race
Authors: Lihao Sun, Chengzhi Mao, Valentin Hofmann, Xuechunzi Bai,
Abstract summary: We show that aligned language models (LMs) overlook racial concepts in early internal representations when the context is ambiguous.<n>We propose a new bias mitigation strategy that works by incentivizing the representation of racial concepts in the early model layers.
Score: 14.700348476541684
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although value-aligned language models (LMs) appear unbiased in explicit bias evaluations, they often exhibit stereotypes in implicit word association tasks, raising concerns about their fair usage. We investigate the mechanisms behind this discrepancy and find that alignment surprisingly amplifies implicit bias in model outputs. Specifically, we show that aligned LMs, unlike their unaligned counterparts, overlook racial concepts in early internal representations when the context is ambiguous. Not representing race likely fails to activate safety guardrails, leading to unintended biases. Inspired by this insight, we propose a new bias mitigation strategy that works by incentivizing the representation of racial concepts in the early model layers. In contrast to conventional mitigation methods of machine unlearning, our interventions find that steering the model to be more aware of racial concepts effectively mitigates implicit bias. Similar to race blindness in humans, ignoring racial nuances can inadvertently perpetuate subtle biases in LMs.

Related papers

Robustly Improving LLM Fairness in Realistic Settings via Interpretability [0.16843915833103415]
Anti-bias prompts fail when realistic contextual details are introduced.<n>We find that adding realistic context such as company names, culture descriptions from public careers pages, and selective hiring constraints induces significant racial and gender biases.<n>Our internal bias mitigation identifies race and gender-correlated directions and applies affine concept editing at inference time.
arXiv Detail & Related papers (2025-06-12T17:34:38Z)
Beneath the Surface: How Large Language Models Reflect Hidden Bias [7.026605828163043]
We introduce the Hidden Bias Benchmark (HBB), a novel dataset designed to assess hidden bias that bias concepts are hidden within naturalistic, subtly framed contexts in real-world scenarios.<n>We analyze six state-of-the-art Large Language Models, revealing that while models reduce bias in response to overt bias, they continue to reinforce biases in nuanced settings.
arXiv Detail & Related papers (2025-02-27T04:25:54Z)
Defining bias in AI-systems: Biased models are fair models [2.8360662552057327]
We argue that a precise conceptualization of bias is necessary to effectively address fairness concerns.<n>Rather than viewing bias as inherently negative or unfair, we highlight the importance of distinguishing between bias and discrimination.
arXiv Detail & Related papers (2025-02-25T10:28:16Z)
Explicit vs. Implicit: Investigating Social Bias in Large Language Models through Self-Reflection [18.625071242029936]
Large Language Models (LLMs) have been shown to exhibit various biases and stereotypes in their generated content.<n>This paper presents a systematic framework to investigate and compare explicit and implicit biases in LLMs.
arXiv Detail & Related papers (2025-01-04T14:08:52Z)
The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models [78.69526166193236]
Pre-trained Language models (PLMs) have been acknowledged to contain harmful information, such as social biases. We propose sc Social Bias Neurons to accurately pinpoint units (i.e., neurons) in a language model that can be attributed to undesirable behavior, such as social bias. As measured by prior metrics from StereoSet, our model achieves a higher degree of fairness while maintaining language modeling ability with low cost.
arXiv Detail & Related papers (2024-06-14T15:41:06Z)
Self-Debiasing Large Language Models: Zero-Shot Recognition and Reduction of Stereotypes [73.12947922129261]
We leverage the zero-shot capabilities of large language models to reduce stereotyping. We show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups. We hope this work opens inquiry into other zero-shot techniques for bias mitigation.
arXiv Detail & Related papers (2024-02-03T01:40:11Z)
What Do Llamas Really Think? Revealing Preference Biases in Language Model Representations [62.91799637259657]
Do large language models (LLMs) exhibit sociodemographic biases, even when they decline to respond? We study this research question by probing contextualized embeddings and exploring whether this bias is encoded in its latent representations. We propose a logistic Bradley-Terry probe which predicts word pair preferences of LLMs from the words' hidden vectors.
arXiv Detail & Related papers (2023-11-30T18:53:13Z)
Looking at the Overlooked: An Analysis on the Word-Overlap Bias in Natural Language Inference [20.112129592923246]
We focus on an overlooked aspect of the overlap bias in NLI models: the reverse word-overlap bias. Current NLI models are highly biased towards the non-entailment label on instances with low overlap. We investigate the reasons for the emergence of the overlap bias and the role of minority examples in its mitigation.
arXiv Detail & Related papers (2022-11-07T21:02:23Z)
The SAME score: Improved cosine based bias score for word embeddings [49.75878234192369]
We introduce SAME, a novel bias score for semantic bias in embeddings. We show that SAME is capable of measuring semantic bias and identify potential causes for social bias in downstream tasks.
arXiv Detail & Related papers (2022-03-28T09:28:13Z)
A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning [55.96577490779591]
Vision-language models can encode societal biases and stereotypes. There are challenges to measuring and mitigating these multimodal harms. We investigate bias measures and apply ranking metrics for image-text representations.
arXiv Detail & Related papers (2022-03-22T17:59:04Z)
Balancing out Bias: Achieving Fairness Through Training Reweighting [58.201275105195485]
Bias in natural language processing arises from models learning characteristics of the author such as gender and race. Existing methods for mitigating and measuring bias do not directly account for correlations between author demographics and linguistic variables. This paper introduces a very simple but highly effective method for countering bias using instance reweighting.
arXiv Detail & Related papers (2021-09-16T23:40:28Z)
OSCaR: Orthogonal Subspace Correction and Rectification of Biases in Word Embeddings [47.721931801603105]
We propose OSCaR, a bias-mitigating method that focuses on disentangling biased associations between concepts instead of removing concepts wholesale. Our experiments on gender biases show that OSCaR is a well-balanced approach that ensures that semantic information is retained in the embeddings and bias is also effectively mitigated.
arXiv Detail & Related papers (2020-06-30T18:18:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.