Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs
- URL: http://arxiv.org/abs/2508.09019v1
- Date: Tue, 12 Aug 2025 15:34:18 GMT
- Title: Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs
- Authors: Shivam Dubey,
- Abstract summary: Large language models (LLMs) are increasingly integrated into societal systems.<n>Traditional methods for mitigating bias often rely on data filtering or post-hoc output moderation.<n>We introduce a complete, end-to-end system that uses techniques from mechanistic interpretability to both identify and actively mitigate bias.
- Score: 0.5076419064097734
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models (LLMs) become more integrated into societal systems, the risk of them perpetuating and amplifying harmful biases becomes a critical safety concern. Traditional methods for mitigating bias often rely on data filtering or post-hoc output moderation, which treat the model as an opaque black box. In this work, we introduce a complete, end-to-end system that uses techniques from mechanistic interpretability to both identify and actively mitigate bias directly within a model's internal workings. Our method involves two primary stages. First, we train linear "probes" on the internal activations of a model to detect the latent representations of various biases (e.g., gender, race, age). Our experiments on \texttt{gpt2-large} demonstrate that these probes can identify biased content with near-perfect accuracy, revealing that bias representations become most salient in the model's later layers. Second, we leverage these findings to compute "steering vectors" by contrasting the model's activation patterns for biased and neutral statements. By adding these vectors during inference, we can actively steer the model's generative process away from producing harmful, stereotypical, or biased content in real-time. We demonstrate the efficacy of this activation steering technique, showing that it successfully alters biased completions toward more neutral alternatives. We present our work as a robust and reproducible system that offers a more direct and interpretable approach to building safer and more accountable LLMs.
Related papers
- Bi-directional Bias Attribution: Debiasing Large Language Models without Modifying Prompts [29.864293711943038]
We propose a framework for detecting stereotype-inducing words and attributing neuron-level bias in large language models.<n>Our framework first identifies stereotype-inducing adjectives and nouns via comparative analysis across demographic groups.<n> Experiments on three widely used LLMs demonstrate that our method effectively reduces bias while preserving overall model performance.
arXiv Detail & Related papers (2026-02-04T10:27:36Z) - Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs [27.544312683007234]
We introduce a new method for understanding, monitoring and controlling fine-tuned large language models (LLMs)<n>We demonstrate that the top singular of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors.<n>For backdoored models that bypasses safety mechanisms when a secret trigger is present, our method stops up to 100% of attacks with a false positive rate below 1.2%.
arXiv Detail & Related papers (2025-07-31T21:04:12Z) - Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs [51.00909549291524]
Large language models (LLMs) exhibit cognitive biases.<n>These biases vary across models and can be amplified by instruction tuning.<n>It remains unclear if these differences in biases stem from pretraining, finetuning, or even random noise.
arXiv Detail & Related papers (2025-07-09T18:01:14Z) - ALBAR: Adversarial Learning approach to mitigate Biases in Action Recognition [52.537021302246664]
Action recognition models often suffer from background bias (i.e., inferring actions based on background cues) and foreground bias (i.e., relying on subject appearance)<n>We propose ALBAR, a novel adversarial training method that mitigates foreground and background biases without requiring specialized knowledge of the bias attributes.<n>We evaluate our method on established background and foreground bias protocols, setting a new state-of-the-art and strongly improving combined debiasing performance by over 12% absolute on HMDB51.
arXiv Detail & Related papers (2025-01-31T20:47:06Z) - Fooling LLM graders into giving better grades through neural activity guided adversarial prompting [26.164839501935973]
We propose a systematic method to reveal such biases in AI evaluation systems.<n>Our approach first identifies hidden neural activity patterns that predict distorted decision outcomes.<n>We demonstrate that this combination can effectively fool large language model graders into assigning much higher grades than humans would.
arXiv Detail & Related papers (2024-12-17T19:08:22Z) - How far can bias go? -- Tracing bias from pretraining data to alignment [54.51310112013655]
This study examines the correlation between gender-occupation bias in pre-training data and their manifestation in LLMs.<n>Our findings reveal that biases present in pre-training data are amplified in model outputs.
arXiv Detail & Related papers (2024-11-28T16:20:25Z) - Looking at Model Debiasing through the Lens of Anomaly Detection [11.113718994341733]
Deep neural networks are sensitive to bias in the data.<n>In this work, we show the importance of accurately predicting the bias-conflicting and bias-aligned samples.<n>We propose a new bias identification method based on anomaly detection.
arXiv Detail & Related papers (2024-07-24T17:30:21Z) - Self-Debiasing Large Language Models: Zero-Shot Recognition and
Reduction of Stereotypes [73.12947922129261]
We leverage the zero-shot capabilities of large language models to reduce stereotyping.
We show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups.
We hope this work opens inquiry into other zero-shot techniques for bias mitigation.
arXiv Detail & Related papers (2024-02-03T01:40:11Z) - Detecting and Mitigating Algorithmic Bias in Binary Classification using
Causal Modeling [0.0]
We show that gender bias in the prediction model is statistically significant at the 0.05 level.
We demonstrate the effectiveness of the causal model in mitigating gender bias by cross-validation.
Our novel approach is intuitive, easy-to-use, and can be implemented using existing statistical software tools such as "lavaan" in R.
arXiv Detail & Related papers (2023-10-19T02:21:04Z) - From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition [64.59093444558549]
We propose a simple, easy-to-implement, two-step training pipeline that we call From Fake to Real.
By training on real and synthetic data separately, FFR does not expose the model to the statistical differences between real and synthetic data.
Our experiments show that FFR improves worst group accuracy over the state-of-the-art by up to 20% over three datasets.
arXiv Detail & Related papers (2023-08-08T19:52:28Z) - Delving into Identify-Emphasize Paradigm for Combating Unknown Bias [52.76758938921129]
We propose an effective bias-conflicting scoring method (ECS) to boost the identification accuracy.
We also propose gradient alignment (GA) to balance the contributions of the mined bias-aligned and bias-conflicting samples.
Experiments are conducted on multiple datasets in various settings, demonstrating that the proposed solution can mitigate the impact of unknown biases.
arXiv Detail & Related papers (2023-02-22T14:50:24Z) - Fighting Fire with Fire: Contrastive Debiasing without Bias-free Data
via Generative Bias-transformation [31.944147533327058]
Contrastive Debiasing via Generative Bias-transformation (CDvG)
We propose a novel method, Contrastive Debiasing via Generative Bias-transformation (CDvG), which works without explicit bias labels or bias-free samples.
Our method demonstrates superior performance compared to prior approaches, especially when bias-free samples are scarce or absent.
arXiv Detail & Related papers (2021-12-02T07:16:06Z) - Learning Debiased Models with Dynamic Gradient Alignment and
Bias-conflicting Sample Mining [39.00256193731365]
Deep neural networks notoriously suffer from dataset biases which are detrimental to model robustness, generalization and fairness.
We propose a two-stage debiasing scheme to combat against the intractable unknown biases.
arXiv Detail & Related papers (2021-11-25T14:50:10Z) - Learning from others' mistakes: Avoiding dataset biases without modeling
them [111.17078939377313]
State-of-the-art natural language processing (NLP) models often learn to model dataset biases and surface form correlations instead of features that target the intended task.
Previous work has demonstrated effective methods to circumvent these issues when knowledge of the bias is available.
We show a method for training models that learn to ignore these problematic correlations.
arXiv Detail & Related papers (2020-12-02T16:10:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.