Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective
- URL: http://arxiv.org/abs/2506.05166v2
- Date: Fri, 06 Jun 2025 01:35:43 GMT
- Title: Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective
- Authors: Bhavik Chandna, Zubair Bashir, Procheta Sen,
- Abstract summary: Large Language Models (LLMs) are known to exhibit social, demographic, and gender biases.<n>We analyze how such biases are structurally represented within models such as GPT-2 and Llama2.<n>We show that removing these components not only reduces biased outputs but also affects other NLP tasks.
- Score: 4.17645248123697
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are known to exhibit social, demographic, and gender biases, often as a consequence of the data on which they are trained. In this work, we adopt a mechanistic interpretability approach to analyze how such biases are structurally represented within models such as GPT-2 and Llama2. Focusing on demographic and gender biases, we explore different metrics to identify the internal edges responsible for biased behavior. We then assess the stability, localization, and generalizability of these components across dataset and linguistic variations. Through systematic ablations, we demonstrate that bias-related computations are highly localized, often concentrated in a small subset of layers. Moreover, the identified components change across fine-tuning settings, including those unrelated to bias. Finally, we show that removing these components not only reduces biased outputs but also affects other NLP tasks, such as named entity recognition and linguistic acceptability judgment because of the sharing of important components with these tasks.
Related papers
- Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets [17.101242741559428]
This paper focuses on intrinsic bias mitigation and measurement strategies for language models.<n>We delve deeper into intrinsic measurements, identifying inconsistencies and suggesting that these benchmarks may reflect different facets of gender stereotype.<n>Our findings underscore the complexity of gender stereotyping in language models and point to new directions for developing more refined techniques to detect and reduce bias.
arXiv Detail & Related papers (2025-01-02T09:40:31Z) - Robustness and Confounders in the Demographic Alignment of LLMs with Human Perceptions of Offensiveness [10.194622474615462]
Large language models (LLMs) are known to exhibit demographic biases, yet few studies systematically evaluate these biases across multiple datasets or account for confounding factors.
Our findings reveal that while demographic traits, particularly race, influence alignment, these effects are inconsistent across datasets and often entangled with other factors.
arXiv Detail & Related papers (2024-11-13T19:08:23Z) - Identifying and Mitigating Social Bias Knowledge in Language Models [52.52955281662332]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases.<n>FAST surpasses state-of-the-art baselines with superior debiasing performance.<n>This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z) - Eliminating Position Bias of Language Models: A Mechanistic Approach [119.34143323054143]
Position bias has proven to be a prevalent issue of modern language models (LMs)<n>Our mechanistic analysis attributes the position bias to two components employed in nearly all state-of-the-art LMs: causal attention and relative positional encodings.<n>By eliminating position bias, models achieve better performance and reliability in downstream tasks, including LM-as-a-judge, retrieval-augmented QA, molecule generation, and math reasoning.
arXiv Detail & Related papers (2024-07-01T09:06:57Z) - Subtle Biases Need Subtler Measures: Dual Metrics for Evaluating Representative and Affinity Bias in Large Language Models [10.73340009530019]
This study addresses two such biases within Large Language Models (LLMs): representative bias and affinity bias.
We introduce two novel metrics to measure these biases: the Representative Bias Score (RBS) and the Affinity Bias Score (ABS)
Our analysis uncovers marked representative biases in prominent LLMs, with a preference for identities associated with being white, straight, and men.
Our investigation of affinity bias reveals distinctive evaluative patterns within each model, akin to bias fingerprints'
arXiv Detail & Related papers (2024-05-23T13:35:34Z) - Locating and Mitigating Gender Bias in Large Language Models [40.78150878350479]
Large language models (LLM) are pre-trained on extensive corpora to learn facts and human cognition which contain human preferences.
This process can inadvertently lead to these models acquiring biases and prevalent stereotypes in society.
We propose the LSDM (Least Square Debias Method), a knowledge-editing based method for mitigating gender bias in occupational pronouns.
arXiv Detail & Related papers (2024-03-21T13:57:43Z) - Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis [86.49858739347412]
Large Language Models (LLMs) have sparked intense debate regarding the prevalence of bias in these models and its mitigation.
We propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the decision process.
We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment.
arXiv Detail & Related papers (2023-11-15T00:02:25Z) - Identifying and Adapting Transformer-Components Responsible for Gender
Bias in an English Language Model [1.6343144783668118]
Language models (LMs) exhibit and amplify many types of undesirable biases learned from the training data, including gender bias.
We study three methods for identifying causal relations between LM components and particular output.
We apply the methods to GPT-2 small and the problem of gender bias, and use the discovered sets of components to perform parameter-efficient fine-tuning for bias mitigation.
arXiv Detail & Related papers (2023-10-19T09:39:21Z) - How to Handle Different Types of Out-of-Distribution Scenarios in Computational Argumentation? A Comprehensive and Fine-Grained Field Study [59.13867562744973]
This work systematically assesses LMs' capabilities for out-of-distribution (OOD) scenarios.
We find that the efficacy of such learning paradigms varies with the type of OOD.
Specifically, while ICL excels for domain shifts, prompt-based fine-tuning surpasses for topic shifts.
arXiv Detail & Related papers (2023-09-15T11:15:47Z) - Stable Bias: Analyzing Societal Representations in Diffusion Models [72.27121528451528]
We propose a new method for exploring the social biases in Text-to-Image (TTI) systems.
Our approach relies on characterizing the variation in generated images triggered by enumerating gender and ethnicity markers in the prompts.
We leverage this method to analyze images generated by 3 popular TTI systems and find that while all of their outputs show correlations with US labor demographics, they also consistently under-represent marginalized identities to different extents.
arXiv Detail & Related papers (2023-03-20T19:32:49Z) - General Greedy De-bias Learning [163.65789778416172]
We propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model like gradient descent in functional space.
GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
arXiv Detail & Related papers (2021-12-20T14:47:32Z) - Balancing out Bias: Achieving Fairness Through Training Reweighting [58.201275105195485]
Bias in natural language processing arises from models learning characteristics of the author such as gender and race.
Existing methods for mitigating and measuring bias do not directly account for correlations between author demographics and linguistic variables.
This paper introduces a very simple but highly effective method for countering bias using instance reweighting.
arXiv Detail & Related papers (2021-09-16T23:40:28Z) - LOGAN: Local Group Bias Detection by Clustering [86.38331353310114]
We argue that evaluating bias at the corpus level is not enough for understanding how biases are embedded in a model.
We propose LOGAN, a new bias detection technique based on clustering.
Experiments on toxicity classification and object classification tasks show that LOGAN identifies bias in a local region.
arXiv Detail & Related papers (2020-10-06T16:42:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.