Attention Speaks Volumes: Localizing and Mitigating Bias in Language Models
- URL: http://arxiv.org/abs/2410.22517v1
- Date: Tue, 29 Oct 2024 20:15:56 GMT
- Title: Attention Speaks Volumes: Localizing and Mitigating Bias in Language Models
- Authors: Rishabh Adiga, Besmira Nushi, Varun Chandrasekaran,
- Abstract summary: We explore the internal mechanisms of how bias emerges in large language models (LLMs) when provided with ambiguous comparative prompts.
We propose $textttATLAS$, a technique to localize bias to specific layers of the LLM by analyzing attention scores and then reduce bias by scaling attention in these biased layers.
- Score: 15.53216696218776
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We explore the internal mechanisms of how bias emerges in large language models (LLMs) when provided with ambiguous comparative prompts: inputs that compare or enforce choosing between two or more entities without providing clear context for preference. Most approaches for bias mitigation focus on either post-hoc analysis or data augmentation. However, these are transient solutions, without addressing the root cause: the model itself. Numerous prior works show the influence of the attention module towards steering generations. We believe that analyzing attention is also crucial for understanding bias, as it provides insight into how the LLM distributes its focus across different entities and how this contributes to biased decisions. To this end, we first introduce a metric to quantify the LLM's preference for one entity over another. We then propose $\texttt{ATLAS}$ (Attention-based Targeted Layer Analysis and Scaling), a technique to localize bias to specific layers of the LLM by analyzing attention scores and then reduce bias by scaling attention in these biased layers. To evaluate our method, we conduct experiments across 3 datasets (BBQ, Crows-Pairs, and WinoGender) using $\texttt{GPT-2 XL}$ (1.5B), $\texttt{GPT-J}$ (6B), $\texttt{LLaMA-2}$ (7B) and $\texttt{LLaMA-3}$ (8B). Our experiments demonstrate that bias is concentrated in the later layers, typically around the last third. We also show how $\texttt{ATLAS}$ effectively mitigates bias through targeted interventions without compromising downstream performance and an average increase of only 0.82% in perplexity when the intervention is applied. We see an average improvement of 0.28 points in the bias score across all the datasets.
Related papers
- Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-Language Models [13.46898179372249]
Vision-Language Models (VLMs) have become indispensable for multimodal reasoning, yet their representations often encode and amplify demographic biases.<n>We propose a geometrically principled framework that identifies and removes the entire subspace of linearly decodable bias.<n>Our method achieves more robust debiasing with an average improvement of $18.5%$ across four fairness metrics.
arXiv Detail & Related papers (2025-11-22T17:04:30Z) - User-Assistant Bias in LLMs [11.825607435336776]
Large language models (LLMs) can bias towards relying on their own or the user's information in chat history, leading to overly stubborn or agreeable behaviors in multi-turn conversations.<n>We introduce an 8k multi-turn conversation dataset $textbfUserAssist$ to benchmark, understand and manipulate the user-assistant bias in frontier LLMs.
arXiv Detail & Related papers (2025-08-16T20:33:09Z) - Positional Biases Shift as Inputs Approach Context Window Limits [57.00239097102958]
The LiM effect is strongest when inputs occupy up to 50% of a model's context window.<n>We observe a distance-based bias, where model performance is better when relevant information is closer to the end of the input.
arXiv Detail & Related papers (2025-08-10T20:40:24Z) - Judging with Many Minds: Do More Perspectives Mean Less Prejudice? On Bias Amplifications and Resistance in Multi-Agent Based LLM-as-Judge [70.89799989428367]
We conduct a systematic analysis of four diverse bias types: position bias, verbosity bias, chain-of-thought bias, and bandwagon bias.<n>We evaluate these biases across two widely adopted multi-agent LLM-as-Judge frameworks: Multi-Agent-Debate and LLM-as-Meta-Judge.
arXiv Detail & Related papers (2025-05-26T03:56:41Z) - Assessing Judging Bias in Large Reasoning Models: An Empirical Study [99.86300466350013]
Large Reasoning Models (LRMs) like DeepSeek-R1 and OpenAI-o1 have demonstrated remarkable reasoning capabilities.
We present a benchmark comparing judging biases between LLMs and LRMs across both subjective preference-alignment datasets and objective fact-based datasets.
arXiv Detail & Related papers (2025-04-14T07:14:27Z) - Robust Bias Detection in MLMs and its Application to Human Trait Ratings [10.067718208225203]
We propose a systematic statistical approach to quantify bias using mixed models.<n>We explore the novel problem of gender bias in the context of $textitpersonality$ and $textitcharacter$ traits.<n>We find that ALBERT is unbiased for binary gender but the most biased for non-binary $textitneo$, while RoBERTa-large is the most biased for binary gender but shows small no bias for $textitneo$.
arXiv Detail & Related papers (2025-02-21T17:18:02Z) - With a Grain of SALT: Are LLMs Fair Across Social Dimensions? [3.5001789247699535]
This paper presents a systematic analysis of biases in open-source Large Language Models (LLMs) across gender, religion, and race.
We use the SALT dataset, which incorporates five distinct bias triggers: General Debate, Positioned Debate, Career Advice, Problem Solving, and CV Generation.
Our findings reveal consistent polarization across models, with certain demographic groups receiving systematically favorable or unfavorable treatment.
arXiv Detail & Related papers (2024-10-16T12:22:47Z) - A Multi-LLM Debiasing Framework [85.17156744155915]
Large Language Models (LLMs) are powerful tools with the potential to benefit society immensely, yet, they have demonstrated biases that perpetuate societal inequalities.
Recent research has shown a growing interest in multi-LLM approaches, which have been demonstrated to be effective in improving the quality of reasoning.
We propose a novel multi-LLM debiasing framework aimed at reducing bias in LLMs.
arXiv Detail & Related papers (2024-09-20T20:24:50Z) - Mind the Gap: A Causal Perspective on Bias Amplification in Prediction & Decision-Making [58.06306331390586]
We introduce the notion of a margin complement, which measures how much a prediction score $S$ changes due to a thresholding operation.
We show that under suitable causal assumptions, the influences of $X$ on the prediction score $S$ are equal to the influences of $X$ on the true outcome $Y$.
arXiv Detail & Related papers (2024-05-24T11:22:19Z) - Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement [75.7148545929689]
Large language models (LLMs) improve their performance through self-feedback on certain tasks while degrade on others.
We formally define LLM's self-bias - the tendency to favor its own generation.
We analyze six LLMs on translation, constrained text generation, and mathematical reasoning tasks.
arXiv Detail & Related papers (2024-02-18T03:10:39Z) - Self-Supervised Position Debiasing for Large Language Models [39.261233221850155]
We propose a self-supervised position debiasing (SOD) framework to mitigate position bias for large language models (LLMs)
Experiments on eight datasets and five tasks show that SOD consistently outperforms existing methods in mitigating three types of position biases.
arXiv Detail & Related papers (2024-01-02T14:12:41Z) - ROBBIE: Robust Bias Evaluation of Large Generative Language Models [27.864027322486375]
Different prompt-based datasets can be used to measure social bias across multiple text domains and demographic axes.
We compare 6 different prompt-based bias and toxicity metrics across 12 demographic axes and 5 families of generative LLMs.
We conduct a comprehensive study of how well 3 bias/toxicity mitigation techniques perform across our suite of measurements.
arXiv Detail & Related papers (2023-11-29T23:03:04Z) - Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis [86.49858739347412]
Large Language Models (LLMs) have sparked intense debate regarding the prevalence of bias in these models and its mitigation.
We propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the decision process.
We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment.
arXiv Detail & Related papers (2023-11-15T00:02:25Z) - General Debiasing for Multimodal Sentiment Analysis [47.05329012210878]
We propose a general debiasing MSA task, which aims to enhance the Out-Of-Distribution (OOD) generalization ability of MSA models.
We employ IPW to reduce the effects of large-biased samples, facilitating robust feature learning for sentiment prediction.
The empirical results demonstrate the superior generalization ability of our proposed framework.
arXiv Detail & Related papers (2023-07-20T00:36:41Z) - A Trip Towards Fairness: Bias and De-Biasing in Large Language Models [1.987426401990999]
Cheap-to-Build Very Large-Language Models (CtB-LLMs) with affordable training are emerging as the next big revolution in natural language processing and understanding.
In this paper, we performed a large investigation of the bias of three families of CtB-LLMs.
We show that debiasing techniques are effective and usable.
arXiv Detail & Related papers (2023-05-23T09:35:37Z) - Feature-Level Debiased Natural Language Understanding [86.8751772146264]
Existing natural language understanding (NLU) models often rely on dataset biases to achieve high performance on specific datasets.
We propose debiasing contrastive learning (DCT) to mitigate biased latent features and neglect the dynamic nature of bias.
DCT outperforms state-of-the-art baselines on out-of-distribution datasets while maintaining in-distribution performance.
arXiv Detail & Related papers (2022-12-11T06:16:14Z) - Mitigating Representation Bias in Action Recognition: Algorithms and
Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects.
We tackle this problem from two different angles: algorithm and dataset.
We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z) - BiasEnsemble: Revisiting the Importance of Amplifying Bias for Debiasing [31.665352191081357]
"Debiasing" aims to train a classifier to be less susceptible to dataset bias.
$f_B$ is trained to focus on bias-aligned samples while $f_D$ is mainly trained with bias-conflicting samples.
We propose a novel biased sample selection method BiasEnsemble which removes the bias-conflicting samples.
arXiv Detail & Related papers (2022-05-29T07:55:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.