Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability
- URL: http://arxiv.org/abs/2510.12229v1
- Date: Tue, 14 Oct 2025 07:31:29 GMT
- Title: Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability
- Authors: Bianca Raimondi, Daniela Dalbagno, Maurizio Gabbrielli,
- Abstract summary: Large language models (LLMs) have been shown to internalize human-like biases during finetuning.<n> Knobe effect, a moral bias in intentionality, emerges in finetuned LLMs.<n> patching activations from corresponding pretrained model into just a few critical layers is sufficient to eliminate the effect.
- Score: 0.7710436567988378
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have been shown to internalize human-like biases during finetuning, yet the mechanisms by which these biases manifest remain unclear. In this work, we investigated whether the well-known Knobe effect, a moral bias in intentionality judgements, emerges in finetuned LLMs and whether it can be traced back to specific components of the model. We conducted a Layer-Patching analysis across 3 open-weights LLMs and demonstrated that the bias is not only learned during finetuning but also localized in a specific set of layers. Surprisingly, we found that patching activations from the corresponding pretrained model into just a few critical layers is sufficient to eliminate the effect. Our findings offer new evidence that social biases in LLMs can be interpreted, localized, and mitigated through targeted interventions, without the need for model retraining.
Related papers
- When Wording Steers the Evaluation: Framing Bias in LLM judges [23.16746081917015]
Large language models (LLMs) are known to produce varying responses depending on prompt phrasing.<n>We investigate how deliberate prompt framing skews model judgments across four high-stakes evaluation tasks.<n>Across 14 judges, we observe clear susceptibility to framing, with model families showing distinct tendencies toward agreement or rejection.
arXiv Detail & Related papers (2026-01-20T02:48:10Z) - Anchors in the Machine: Behavioral and Attributional Evidence of Anchoring Bias in LLMs [0.0]
This paper advances the study of anchoring in large language models (LLMs) through three contributions.<n>Results reveal robust anchoring effects in Gemma-2B, Phi-2, and Llama-2-7B, with attribution signaling that the anchors influence reweighting.<n>Findings demonstrate that anchoring bias in LLMs is robust, measurable, and interpretable, while highlighting risks in applied domains.
arXiv Detail & Related papers (2025-11-07T23:35:19Z) - Mitigating Spurious Correlations in LLMs via Causality-Aware Post-Training [57.03005244917803]
Large language models (LLMs) often fail on out-of-distribution (OOD) samples due to spurious correlations acquired during pre-training.<n>Here, we aim to mitigate such spurious correlations through causality-aware post-training (CAPT)<n> Experiments on the formal causal inference benchmark CLadder and the logical reasoning dataset PrOntoQA show that 3B-scale language models fine-tuned with CAPT can outperform both traditional SFT and larger LLMs on in-distribution (ID) and OOD tasks.
arXiv Detail & Related papers (2025-06-11T06:30:28Z) - Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge [84.34545223897578]
Despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility.
We identify 12 key potential biases and propose a new automated bias quantification framework-CALM- which quantifies and analyzes each type of bias in LLM-as-a-Judge.
Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
arXiv Detail & Related papers (2024-10-03T17:53:30Z) - Promoting Equality in Large Language Models: Identifying and Mitigating the Implicit Bias based on Bayesian Theory [29.201402717025335]
Large language models (LLMs) are trained on extensive text corpora, which inevitably include biased information.
We have formally defined the implicit bias problem and developed an innovative framework for bias removal based on Bayesian theory.
arXiv Detail & Related papers (2024-08-20T07:40:12Z) - UniBias: Unveiling and Mitigating LLM Bias through Internal Attention and FFN Manipulation [12.04811490937078]
We investigate how feedforward neural networks (FFNs) and attention heads result in the bias of large language models (LLMs)<n>To mitigate these biases, we introduce UniBias, an inference-only method that effectively identifies and eliminates biased FFN vectors and attention heads.
arXiv Detail & Related papers (2024-05-31T03:59:15Z) - Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models.
This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution.
We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z) - Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs [6.090496490133132]
We propose Reinforcement Learning from Multi-role Debates as Feedback (RLDF), a novel approach for bias mitigation replacing human feedback in traditional RLHF.
We utilize LLMs in multi-role debates to create a dataset that includes both high-bias and low-bias instances for training the reward model in reinforcement learning.
arXiv Detail & Related papers (2024-04-15T22:18:50Z) - Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis [86.49858739347412]
Large Language Models (LLMs) have sparked intense debate regarding the prevalence of bias in these models and its mitigation.
We propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the decision process.
We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment.
arXiv Detail & Related papers (2023-11-15T00:02:25Z) - Are You Sure? Challenging LLMs Leads to Performance Drops in The
FlipFlop Experiment [82.60594940370919]
We propose the FlipFlop experiment to study the multi-turn behavior of Large Language Models (LLMs)
We show that models flip their answers on average 46% of the time and that all models see a deterioration of accuracy between their first and final prediction, with an average drop of 17% (the FlipFlop effect)
We conduct finetuning experiments on an open-source LLM and find that finetuning on synthetically created data can mitigate - reducing performance deterioration by 60% - but not resolve sycophantic behavior entirely.
arXiv Detail & Related papers (2023-11-14T23:40:22Z) - An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning [70.48605869773814]
Catastrophic forgetting (CF) is a phenomenon that occurs in machine learning when a model forgets previously learned information.<n>This study empirically evaluates the forgetting phenomenon in large language models during continual instruction tuning.
arXiv Detail & Related papers (2023-08-17T02:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.