Related papers: Race, Ethnicity and Their Implication on Bias in Large Language Models

Race, Ethnicity and Their Implication on Bias in Large Language Models

URL: http://arxiv.org/abs/2601.12868v1
Date: Mon, 19 Jan 2026 09:24:24 GMT
Title: Race, Ethnicity and Their Implication on Bias in Large Language Models
Authors: Shiyue Hu, Ruizhe Li, Yanjun Gao,
Abstract summary: We study how race and ethnicity are represented and operationalized within large language models (LLMs)<n>We find that demographic information is distributed across internal units with substantial cross-model variation.<n> Interventions suppressing such neurons reduce bias but leave substantial residual effects.
Score: 9.202525724606188
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) increasingly operate in high-stakes settings including healthcare and medicine, where demographic attributes such as race and ethnicity may be explicitly stated or implicitly inferred from text. However, existing studies primarily document outcome-level disparities, offering limited insight into internal mechanisms underlying these effects. We present a mechanistic study of how race and ethnicity are represented and operationalized within LLMs. Using two publicly available datasets spanning toxicity-related generation and clinical narrative understanding tasks, we analyze three open-source models with a reproducible interpretability pipeline combining probing, neuron-level attribution, and targeted intervention. We find that demographic information is distributed across internal units with substantial cross-model variation. Although some units encode sensitive or stereotype-related associations from pretraining, identical demographic cues can induce qualitatively different behaviors. Interventions suppressing such neurons reduce bias but leave substantial residual effects, suggesting behavioral rather than representational change and motivating more systematic mitigation.

Related papers

Investigating Associational Biases in Inter-Model Communication of Large Generative Models [8.394205333688165]
Social bias in generative AI can manifest as performance disparities and as associational bias.<n>We study how associations evolve within an inter-model communication pipeline that alternates between image generation and image description.<n>Our results reveal demographic drifts toward younger representations for both actions and emotions, as well as toward more female-presenting representations.
arXiv Detail & Related papers (2026-01-29T18:29:55Z)
Demographic Probing of Large Language Models Lacks Construct Validity [16.29607362682272]
We study how large language models adapt their behavior to demographic attributes.<n>This approach typically uses a single demographic cue in isolation as a signal for group membership.<n>We find that cues intended to represent the same demographic group induce only partially overlapping changes in model behavior.
arXiv Detail & Related papers (2026-01-26T13:41:35Z)
Mitigation of Gender and Ethnicity Bias in AI-Generated Stories through Model Explanations [2.86989372262348]
Language models have been shown to propagate social bias through their output, particularly in the representation of gender and ethnicity.<n>This paper investigates gender and ethnicity biases in AI-generated occupational stories.<n>Our proposed mitigation strategy, Bias Analysis and Mitigation through Explanation (BAME), reveals improvements in demographic representation ranging from 2% to 20%.
arXiv Detail & Related papers (2025-09-03T00:25:25Z)
How Quantization Shapes Bias in Large Language Models [61.40435736418359]
We focus on weight and activation quantization strategies and examine their effects across a broad range of bias types.<n>We employ both probabilistic and generated text-based metrics across nine benchmarks and evaluate models varying in architecture family and reasoning ability.
arXiv Detail & Related papers (2025-08-25T14:48:26Z)
Small Changes, Large Consequences: Analyzing the Allocational Fairness of LLMs in Hiring Contexts [19.20592062296075]
Large language models (LLMs) are increasingly being deployed in high-stakes applications like hiring.<n>This work examines the allocational fairness of LLM-based hiring systems through two tasks that reflect actual HR usage.
arXiv Detail & Related papers (2025-01-08T07:28:10Z)
The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models [78.69526166193236]
Pre-trained Language models (PLMs) have been acknowledged to contain harmful information, such as social biases. We propose sc Social Bias Neurons to accurately pinpoint units (i.e., neurons) in a language model that can be attributed to undesirable behavior, such as social bias. As measured by prior metrics from StereoSet, our model achieves a higher degree of fairness while maintaining language modeling ability with low cost.
arXiv Detail & Related papers (2024-06-14T15:41:06Z)
Decoding Susceptibility: Modeling Misbelief to Misinformation Through a Computational Approach [61.04606493712002]
Susceptibility to misinformation describes the degree of belief in unverifiable claims that is not observable. Existing susceptibility studies heavily rely on self-reported beliefs. We propose a computational approach to model users' latent susceptibility levels.
arXiv Detail & Related papers (2023-11-16T07:22:56Z)
Sensitivity, Performance, Robustness: Deconstructing the Effect of Sociodemographic Prompting [64.80538055623842]
sociodemographic prompting is a technique that steers the output of prompt-based models towards answers that humans with specific sociodemographic profiles would give. We show that sociodemographic information affects model predictions and can be beneficial for improving zero-shot learning in subjective NLP tasks.
arXiv Detail & Related papers (2023-09-13T15:42:06Z)
Bias and Fairness in Large Language Models: A Survey [73.87651986156006]
We present a comprehensive survey of bias evaluation and mitigation techniques for large language models (LLMs) We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing. We then unify the literature by proposing three intuitive, two for bias evaluation, and one for mitigation.
arXiv Detail & Related papers (2023-09-02T00:32:55Z)
Ecosystem-level Analysis of Deployed Machine Learning Reveals Homogeneous Outcomes [72.13373216644021]
We study the societal impact of machine learning by considering the collection of models that are deployed in a given context. We find deployed machine learning is prone to systemic failure, meaning some users are exclusively misclassified by all models available. These examples demonstrate ecosystem-level analysis has unique strengths for characterizing the societal impact of machine learning.
arXiv Detail & Related papers (2023-07-12T01:11:52Z)
Write It Like You See It: Detectable Differences in Clinical Notes By Race Lead To Differential Model Recommendations [15.535251319178379]
We investigate the level of implicit race information available to machine learning models and human experts. We find that models can identify patient self-reported race from clinical notes even when the notes are stripped of explicit indicators of race. We show that models trained on these race-redacted clinical notes can still perpetuate existing biases in clinical treatment decisions.
arXiv Detail & Related papers (2022-05-08T18:24:11Z)
Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias [45.956112337250275]
We propose a methodology grounded in the theory of causal mediation analysis for interpreting which parts of a model are causally implicated in its behavior. We apply this methodology to analyze gender bias in pre-trained Transformer language models. Our mediation analysis reveals that gender bias effects are (i) sparse, concentrated in a small part of the network; (ii) synergistic, amplified or repressed by different components; and (iii) decomposable into effects flowing directly from the input and indirectly through the mediators.
arXiv Detail & Related papers (2020-04-26T01:53:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.