Related papers: KnowBias: Mitigating Social Bias in LLMs via Know-Bias Neuron Enhancement

KnowBias: Mitigating Social Bias in LLMs via Know-Bias Neuron Enhancement

URL: http://arxiv.org/abs/2601.21864v1
Date: Thu, 29 Jan 2026 15:32:38 GMT
Title: KnowBias: Mitigating Social Bias in LLMs via Know-Bias Neuron Enhancement
Authors: Jinhao Pan, Chahat Raj, Anjishnu Mukherjee, Sina Mansouri, Bowen Wei, Shloka Yada, Ziwei Zhu,
Abstract summary: Large language models (LLMs) exhibit social biases that reinforce harmful stereotypes, limiting their safe deployment.<n>We propose KnowBias, a framework that mitigates bias by strengthening, rather than suppressing, neurons encoding bias-knowledge.<n>KnowBias identifies neurons encoding bias knowledge using a small set of bias-knowledge questions via attribution-based analysis, and selectively enhances them at inference time.
Score: 5.243877326529689
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) exhibit social biases that reinforce harmful stereotypes, limiting their safe deployment. Most existing debiasing methods adopt a suppressive paradigm by modifying parameters, prompts, or neurons associated with biased behavior; however, such approaches are often brittle, weakly generalizable, data-inefficient, and prone to degrading general capability. We propose \textbf{KnowBias}, a lightweight and conceptually distinct framework that mitigates bias by strengthening, rather than suppressing, neurons encoding bias-knowledge. KnowBias identifies neurons encoding bias knowledge using a small set of bias-knowledge questions via attribution-based analysis, and selectively enhances them at inference time. This design enables strong debiasing while preserving general capabilities, generalizes across bias types and demographics, and is highly data efficient, requiring only a handful of simple yes/no questions and no retraining. Experiments across multiple benchmarks and LLMs demonstrate consistent state-of-the-art debiasing performance with minimal utility degradation. Data and code are available at https://github.com/JP-25/KnowBias.

Related papers

Bi-directional Bias Attribution: Debiasing Large Language Models without Modifying Prompts [29.864293711943038]
We propose a framework for detecting stereotype-inducing words and attributing neuron-level bias in large language models.<n>Our framework first identifies stereotype-inducing adjectives and nouns via comparative analysis across demographic groups.<n> Experiments on three widely used LLMs demonstrate that our method effectively reduces bias while preserving overall model performance.
arXiv Detail & Related papers (2026-02-04T10:27:36Z)
Making Bias Non-Predictive: Training Robust LLM Judges via Reinforcement Learning [91.8584139564909]
Large language models (LLMs) increasingly serve as automated judges, yet they remain susceptible to cognitive biases.<n>We propose Epistemic Independence Training (EIT), a reinforcement learning framework grounded in a key principle.<n>EIT operationalizes this through a balanced conflict strategy where bias signals are equally likely to support correct and incorrect answers.
arXiv Detail & Related papers (2026-02-02T01:43:48Z)
Adaptive Generation of Bias-Eliciting Questions for LLMs [18.608477560948003]
Large language models (LLMs) are now widely deployed in user-facing applications, reaching hundreds of millions worldwide.<n>We introduce a counterfactual bias evaluation framework that automatically generates realistic, open-ended questions over sensitive attributes such as sex, race, or religion.<n>We also capture distinct response dimensions that are increasingly relevant in user interactions, such as asymmetric refusals and explicit acknowledgment of bias.
arXiv Detail & Related papers (2025-10-14T13:08:10Z)
BLADE: Bias-Linked Adaptive DEbiasing [2.7352017408152083]
BLADE is a generative debiasing framework that requires no prior knowledge of bias or bias-conflicting samples.<n>We evaluate BLADE on multiple benchmark datasets and show that it significantly outperforms state-of-the-art methods.
arXiv Detail & Related papers (2025-10-05T12:28:54Z)
What's Not Said Still Hurts: A Description-Based Evaluation Framework for Measuring Social Bias in LLMs [8.219247185418821]
Large Language Models (LLMs) often exhibit social biases inherited from their training data.<n>We introduce the Description-based Bias Benchmark (DBB), a novel dataset designed to assess bias at the semantic level.<n>We analyze six state-of-the-art LLMs, revealing that while models reduce bias in response at the term level, they continue to reinforce biases in nuanced settings.
arXiv Detail & Related papers (2025-02-27T04:25:54Z)
Debiasify: Self-Distillation for Unsupervised Bias Mitigation [19.813054813868476]
Simplicity bias poses a significant challenge in neural networks, often leading models to favor simpler solutions and inadvertently learn decision rules influenced by spurious correlations. We introduce Debiasify, a novel self-distillation approach that requires no prior knowledge about the nature of biases. Our method leverages a new distillation loss to transfer knowledge within the network, from deeper layers containing complex, highly-predictive features to shallower layers with simpler, attribute-conditioned features in an unsupervised manner.
arXiv Detail & Related papers (2024-11-01T16:25:05Z)
Identifying and Mitigating Social Bias Knowledge in Language Models [52.52955281662332]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases.<n>FAST surpasses state-of-the-art baselines with superior debiasing performance.<n>This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z)
Self-Debiasing Large Language Models: Zero-Shot Recognition and Reduction of Stereotypes [73.12947922129261]
We leverage the zero-shot capabilities of large language models to reduce stereotyping. We show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups. We hope this work opens inquiry into other zero-shot techniques for bias mitigation.
arXiv Detail & Related papers (2024-02-03T01:40:11Z)
Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination [54.865941973768905]
We propose a novel and practical bias mitigation method, CRISPR, to eliminate bias neurons of language models in instruction-following settings. CRISPR automatically determines biased outputs and categorizes neurons that affect the biased outputs as bias neurons using an explainability method. Experimental results demonstrate the effectiveness of our method in mitigating biases under zero-shot instruction-following settings without losing the model's task performance and existing knowledge.
arXiv Detail & Related papers (2023-11-16T07:16:55Z)
Causality and Independence Enhancement for Biased Node Classification [56.38828085943763]
We propose a novel Causality and Independence Enhancement (CIE) framework, applicable to various graph neural networks (GNNs) Our approach estimates causal and spurious features at the node representation level and mitigates the influence of spurious correlations. Our approach CIE not only significantly enhances the performance of GNNs but outperforms state-of-the-art debiased node classification methods.
arXiv Detail & Related papers (2023-10-14T13:56:24Z)
Unsupervised Learning of Unbiased Visual Representations [12.690228982893]
Deep neural networks often struggle to learn robust representations in the presence of dataset biases.<n>Existing approaches to address this problem typically involve explicit supervision of bias attributes or reliance on prior knowledge about the biases.<n>We present a fully unsupervised debiasing framework with three key steps.
arXiv Detail & Related papers (2022-04-26T10:51:50Z)
Evading the Simplicity Bias: Training a Diverse Set of Models Discovers Solutions with Superior OOD Generalization [93.8373619657239]
Neural networks trained with SGD were recently shown to rely preferentially on linearly-predictive features. This simplicity bias can explain their lack of robustness out of distribution (OOD) We demonstrate that the simplicity bias can be mitigated and OOD generalization improved.
arXiv Detail & Related papers (2021-05-12T12:12:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.