Related papers: Correcting Negative Bias in Large Language Models through Negative Attention Score Alignment

Correcting Negative Bias in Large Language Models through Negative Attention Score Alignment

URL: http://arxiv.org/abs/2408.00137v2
Date: Tue, 29 Apr 2025 01:52:03 GMT
Title: Correcting Negative Bias in Large Language Models through Negative Attention Score Alignment
Authors: Sangwon Yu, Jongyoon Song, Bongkyu Hwang, Hoyoung Kang, Sooah Cho, Junhwa Choi, Seongho Joe, Taehee Lee, Youngjune L. Gwon, Sungroh Yoon,
Abstract summary: We observe that language models exhibit a negative bias in the binary decisions of complex reasoning tasks.<n>We propose a negative attention score (NAS) to systematically and quantitatively formulate negative bias.<n>We show that NASA significantly reduces the gap between precision and recall caused by negative bias while preserving their generalization abilities.
Score: 26.598938398594402
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A binary decision task, like yes-no questions or answer verification, reflects a significant real-world scenario such as where users look for confirmation about the correctness of their decisions on specific issues. In this work, we observe that language models exhibit a negative bias in the binary decisions of complex reasoning tasks. Based on our observations and the rationale about attention-based model dynamics, we propose a negative attention score (NAS) to systematically and quantitatively formulate negative bias. Based on NAS, we identify attention heads that attend to negative tokens provided in the instructions as answer candidate of binary decisions, regardless of the question in the prompt, and validate their association with the negative bias. Additionally, we propose the negative attention score alignment (NASA) method, which is a parameter-efficient fine-tuning technique to address the extracted negatively biased attention heads. Experimental results from various domains of reasoning tasks and large model search space demonstrate that NASA significantly reduces the gap between precision and recall caused by negative bias while preserving their generalization abilities.

Related papers

A Multifaceted Analysis of Negative Bias in Large Language Models through the Lens of Parametric Knowledge [48.00855840536793]
Negative bias refers to the tendency of large language models to excessively generate negative responses in binary decision tasks.<n>We show that large language models exhibit format-level negative bias, meaning the prompt format more influences their responses than the semantics of the negative response.<n>We identify a shortcut behavior in which models tend to generate negative responses when they lack sufficient knowledge to answer a yes-no question.
arXiv Detail & Related papers (2025-11-14T01:18:18Z)
Internal Bias in Reasoning Models leads to Overthinking [58.817405319722596]
We show for the first time that overthinking in reasoning models may stem from their internal bias towards input texts.<n>By masking out the original input section, the affect of internal bias can be effectively alleviated and the reasoning length could be reduced by 31%-53%.
arXiv Detail & Related papers (2025-05-22T09:35:52Z)
Covert Bias: The Severity of Social Views' Unalignment in Language Models Towards Implicit and Explicit Opinion [0.40964539027092917]
We evaluate the severity of bias toward a view by using a biased model in edge cases of excessive bias scenarios. Our findings reveal a discrepancy in LLM performance in identifying implicit and explicit opinions, with a general tendency of bias toward explicit opinions of opposing stances. The direct, incautious responses of the unaligned models suggest a need for further refinement of decisiveness.
arXiv Detail & Related papers (2024-08-15T15:23:00Z)
Eliminating Position Bias of Language Models: A Mechanistic Approach [119.34143323054143]
Position bias has proven to be a prevalent issue of modern language models (LMs) Our mechanistic analysis attributes the position bias to two components employed in nearly all state-of-the-art LMs: causal attention and relative positional encodings. By eliminating position bias, models achieve better performance and reliability in downstream tasks, including LM-as-a-judge, retrieval-augmented QA, molecule generation, and math reasoning.
arXiv Detail & Related papers (2024-07-01T09:06:57Z)
The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models [78.69526166193236]
Pre-trained Language models (PLMs) have been acknowledged to contain harmful information, such as social biases. We propose sc Social Bias Neurons to accurately pinpoint units (i.e., neurons) in a language model that can be attributed to undesirable behavior, such as social bias. As measured by prior metrics from StereoSet, our model achieves a higher degree of fairness while maintaining language modeling ability with low cost.
arXiv Detail & Related papers (2024-06-14T15:41:06Z)
Applying Intrinsic Debiasing on Downstream Tasks: Challenges and Considerations for Machine Translation [19.06428714669272]
We systematically test how methods for intrinsic debiasing affect neural machine translation models. We highlight three challenges and mismatches between the debiasing techniques and their end-goal usage.
arXiv Detail & Related papers (2024-06-02T15:57:29Z)
Investigating Bias Representations in Llama 2 Chat via Activation Steering [0.0]
We use activation steering to probe for and mitigate biases related to gender, race, and religion. Our findings reveal inherent gender bias in Llama 2 7B Chat, persisting even after Reinforcement Learning from Human Feedback. This work also provides valuable insights into effective red-teaming strategies for Large Language Models.
arXiv Detail & Related papers (2024-02-01T07:48:50Z)
Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination [54.865941973768905]
We propose a novel and practical bias mitigation method, CRISPR, to eliminate bias neurons of language models in instruction-following settings. CRISPR automatically determines biased outputs and categorizes neurons that affect the biased outputs as bias neurons using an explainability method. Experimental results demonstrate the effectiveness of our method in mitigating biases under zero-shot instruction-following settings without losing the model's task performance and existing knowledge.
arXiv Detail & Related papers (2023-11-16T07:16:55Z)
Mitigating Bias for Question Answering Models by Tracking Bias Influence [84.66462028537475]
We propose BMBI, an approach to mitigate the bias of multiple-choice QA models. Based on the intuition that a model would lean to be more biased if it learns from a biased example, we measure the bias level of a query instance. We show that our method could be applied to multiple QA formulations across multiple bias categories.
arXiv Detail & Related papers (2023-10-13T00:49:09Z)
HANS, are you clever? Clever Hans Effect Analysis of Neural Systems [1.6267479602370545]
Large Language Models (It-LLMs) have been exhibiting outstanding abilities to reason around cognitive states, intentions, and reactions of all people involved, letting humans guide and comprehend day-to-day social interactions effectively. Several multiple-choice questions (MCQ) benchmarks have been proposed to construct solid assessments of the models' abilities. However, earlier works are demonstrating the presence of inherent "order bias" in It-LLMs, posing challenges to the appropriate evaluation.
arXiv Detail & Related papers (2023-09-21T20:52:18Z)
Making Large Language Models Better Reasoners with Alignment [57.82176656663245]
Reasoning is a cognitive process of using evidence to reach a sound conclusion. Recent studies reveal that fine-tuning LLMs on data with the chain of thought (COT) reasoning process can significantly enhance their reasoning capabilities. We introduce an textitAlignment Fine-Tuning (AFT) paradigm, which involves three steps.
arXiv Detail & Related papers (2023-09-05T11:32:48Z)
Controlling Bias Exposure for Fair Interpretable Predictions [11.364105288235308]
We argue that a favorable debiasing method should use sensitive information 'fairly' rather than blindly eliminating it. Our model achieves a desirable trade-off between debiasing and task performance along with producing debiased rationales as evidence.
arXiv Detail & Related papers (2022-10-14T01:49:01Z)
InterFair: Debiasing with Natural Language Feedback for Fair Interpretable Predictions [30.246111670391375]
We argue that a favorable debiasing method should use sensitive information 'fairly,' with explanations, rather than blindly eliminating it. We explore two interactive setups with a frozen predictive model and show that users able to provide feedback can achieve a better and fairer balance between task performance and bias mitigation.
arXiv Detail & Related papers (2022-10-14T00:54:12Z)
General Greedy De-bias Learning [163.65789778416172]
We propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model like gradient descent in functional space. GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
arXiv Detail & Related papers (2021-12-20T14:47:32Z)
Learning Debiased Models with Dynamic Gradient Alignment and Bias-conflicting Sample Mining [39.00256193731365]
Deep neural networks notoriously suffer from dataset biases which are detrimental to model robustness, generalization and fairness. We propose a two-stage debiasing scheme to combat against the intractable unknown biases.
arXiv Detail & Related papers (2021-11-25T14:50:10Z)
UnQovering Stereotyping Biases via Underspecified Questions [68.81749777034409]
We present UNQOVER, a framework to probe and quantify biases through underspecified questions. We show that a naive use of model scores can lead to incorrect bias estimates due to two forms of reasoning errors. We use this metric to analyze four important classes of stereotypes: gender, nationality, ethnicity, and religion.
arXiv Detail & Related papers (2020-10-06T01:49:52Z)
Fairness Through Robustness: Investigating Robustness Disparity in Deep Learning [61.93730166203915]
We argue that traditional notions of fairness are not sufficient when the model is vulnerable to adversarial attacks. We show that measuring robustness bias is a challenging task for DNNs and propose two methods to measure this form of bias.
arXiv Detail & Related papers (2020-06-17T22:22:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.