Related papers: UniBias: Unveiling and Mitigating LLM Bias through Internal Attention and FFN Manipulation

UniBias: Unveiling and Mitigating LLM Bias through Internal Attention and FFN Manipulation

URL: http://arxiv.org/abs/2405.20612v1
Date: Fri, 31 May 2024 03:59:15 GMT
Title: UniBias: Unveiling and Mitigating LLM Bias through Internal Attention and FFN Manipulation
Authors: Hanzhang Zhou, Zijian Feng, Zixiao Zhu, Junlang Qian, Kezhi Mao,
Abstract summary: We investigate how feedforward neural networks (FFNs) and attention heads result in the bias of large language models (LLMs) To mitigate these biases, we introduce UniBias, an inference-only method that effectively identifies and eliminates biased FFN vectors and attention heads.
Score: 12.04811490937078
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have demonstrated impressive capabilities in various tasks using the in-context learning (ICL) paradigm. However, their effectiveness is often compromised by inherent bias, leading to prompt brittleness, i.e., sensitivity to design settings such as example selection, order, and prompt formatting. Previous studies have addressed LLM bias through external adjustment of model outputs, but the internal mechanisms that lead to such bias remain unexplored. Our work delves into these mechanisms, particularly investigating how feedforward neural networks (FFNs) and attention heads result in the bias of LLMs. By Interpreting the contribution of individual FFN vectors and attention heads, we identify the biased LLM components that skew LLMs' prediction toward specific labels. To mitigate these biases, we introduce UniBias, an inference-only method that effectively identifies and eliminates biased FFN vectors and attention heads. Extensive experiments across 12 NLP datasets demonstrate that UniBias significantly enhances ICL performance and alleviates prompt brittleness of LLMs.

Related papers

Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [57.28671084993782]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities. We propose a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.
arXiv Detail & Related papers (2025-02-04T17:26:58Z)
Anchoring Bias in Large Language Models: An Experimental Study [5.229564709919574]
Large Language Models (LLMs) like GPT-4 and Gemini have significantly advanced artificial intelligence. This study delves into anchoring bias, a cognitive bias where initial information disproportionately influences judgment.
arXiv Detail & Related papers (2024-12-09T15:45:03Z)
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge [84.34545223897578]
Despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. We identify 12 key potential biases and propose a new automated bias quantification framework-CALM- which quantifies and analyzes each type of bias in LLM-as-a-Judge. Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
arXiv Detail & Related papers (2024-10-03T17:53:30Z)
A Multi-LLM Debiasing Framework [85.17156744155915]
Large Language Models (LLMs) are powerful tools with the potential to benefit society immensely, yet, they have demonstrated biases that perpetuate societal inequalities. Recent research has shown a growing interest in multi-LLM approaches, which have been demonstrated to be effective in improving the quality of reasoning. We propose a novel multi-LLM debiasing framework aimed at reducing bias in LLMs.
arXiv Detail & Related papers (2024-09-20T20:24:50Z)
From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning [89.9648814145473]
Large Language Models (LLMs) tend to prioritize adherence to user prompts over providing veracious responses. Recent works propose to employ supervised fine-tuning (SFT) to mitigate the sycophancy issue. We propose a novel supervised pinpoint tuning (SPT), where the region-of-interest modules are tuned for a given objective.
arXiv Detail & Related papers (2024-09-03T07:01:37Z)
Unboxing Occupational Bias: Grounded Debiasing of LLMs with U.S. Labor Data [9.90951705988724]
Large Language Models (LLM) are prone to inheriting and amplifying societal biases. LLM bias can have far-reaching consequences, leading to unfair practices and exacerbating social inequalities.
arXiv Detail & Related papers (2024-08-20T23:54:26Z)
Investigating Bias in LLM-Based Bias Detection: Disparities between LLMs and Human Perception [13.592532358127293]
We investigate the presence and nature of bias within Large Language Models (LLMs) We probe whether LLMs exhibit biases, particularly in political bias prediction and text continuation tasks. We propose debiasing strategies, including prompt engineering and model fine-tuning.
arXiv Detail & Related papers (2024-03-22T00:59:48Z)
Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing Framework [20.753141804841]
Large language models (LLMs) can easily generate biased and discriminative responses. This paper focuses on social bias, tackling the association between demographic information and LLM outputs.
arXiv Detail & Related papers (2024-03-13T17:46:28Z)
Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing. Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image. To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
Causal Prompting: Debiasing Large Language Model Prompting based on Front-Door Adjustment [32.12998469814097]
A novel causal prompting method based on front-door adjustment is proposed to effectively mitigate Large Language Models (LLMs) biases. Experimental results show that the proposed causal prompting approach achieves excellent performance across seven natural language processing datasets.
arXiv Detail & Related papers (2024-03-05T07:47:34Z)
The Gaps between Pre-train and Downstream Settings in Bias Evaluation and Debiasing [74.7319697510621]
In-Context Learning (ICL) induces smaller changes to PLMs compared to FT-based debiasing methods. ICL-based debiasing methods show a higher correlation between intrinsic and extrinsic bias scores compared to FT-based methods.
arXiv Detail & Related papers (2024-01-16T17:15:08Z)
Self-Supervised Position Debiasing for Large Language Models [39.261233221850155]
We propose a self-supervised position debiasing (SOD) framework to mitigate position bias for large language models (LLMs) Experiments on eight datasets and five tasks show that SOD consistently outperforms existing methods in mitigating three types of position biases.
arXiv Detail & Related papers (2024-01-02T14:12:41Z)
On the Relation between Internal Language Model and Sequence Discriminative Training for Neural Transducers [52.88268942796418]
Internal language model (ILM) subtraction has been widely applied to improve the performance of the RNN-Transducer. We show that sequence discriminative training has a strong correlation with ILM subtraction from both theoretical and empirical points of view.
arXiv Detail & Related papers (2023-09-25T13:35:28Z)
An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning [70.48605869773814]
Catastrophic forgetting (CF) is a phenomenon that occurs in machine learning when a model forgets previously learned information. This study empirically evaluates the forgetting phenomenon in large language models during continual instruction tuning.
arXiv Detail & Related papers (2023-08-17T02:53:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.