Related papers: Explaining News Bias Detection: A Comparative SHAP Analysis of Transformer Model Decision Mechanisms

Explaining News Bias Detection: A Comparative SHAP Analysis of Transformer Model Decision Mechanisms

URL: http://arxiv.org/abs/2512.23835v1
Date: Mon, 29 Dec 2025 19:58:11 GMT
Title: Explaining News Bias Detection: A Comparative SHAP Analysis of Transformer Model Decision Mechanisms
Authors: Himel Ghosh,
Abstract summary: We present a comparative interpretability study of two bias detection models: a bias detector fine-tuned on the BABE dataset and a domain-adapted pre-trained RoBERTa model fine-tuned on the BABE dataset.<n>We analyze word-level attributions across correct and incorrect predictions to characterize how different model architectures operationalize linguistic bias.
Score: 0.2538209532048867
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated bias detection in news text is heavily used to support journalistic analysis and media accountability, yet little is known about how bias detection models arrive at their decisions or why they fail. In this work, we present a comparative interpretability study of two transformer-based bias detection models: a bias detector fine-tuned on the BABE dataset and a domain-adapted pre-trained RoBERTa model fine-tuned on the BABE dataset, using SHAP-based explanations. We analyze word-level attributions across correct and incorrect predictions to characterize how different model architectures operationalize linguistic bias. Our results show that although both models attend to similar categories of evaluative language, they differ substantially in how these signals are integrated into predictions. The bias detector model assigns stronger internal evidence to false positives than to true positives, indicating a misalignment between attribution strength and prediction correctness and contributing to systematic over-flagging of neutral journalistic content. In contrast, the domain-adaptive model exhibits attribution patterns that better align with prediction outcomes and produces 63\% fewer false positives. We further demonstrate that model errors arise from distinct linguistic mechanisms, with false positives driven by discourse-level ambiguity rather than explicit bias cues. These findings highlight the importance of interpretability-aware evaluation for bias detection systems and suggest that architectural and training choices critically affect both model reliability and deployment suitability in journalistic contexts.

Related papers

Mitigating Biases in Language Models via Bias Unlearning [27.565946855618368]
We propose BiasUnlearn, a novel model debiasing framework which achieves targeted debiasing via dual-pathway unlearning mechanisms.<n>The results show that BiasUnlearn outperforms existing methods in mitigating bias in language models while retaining language modeling capabilities.
arXiv Detail & Related papers (2025-09-30T02:15:12Z)
To Bias or Not to Bias: Detecting bias in News with bias-detector [1.8024397171920885]
We perform sentence-level bias classification by fine-tuning a RoBERTa-based model on the expert-annotated BABE dataset.<n>We show statistically significant improvements in performance when comparing our model to a domain-adaptively pre-trained DA-RoBERTa baseline.<n>Our findings contribute to building more robust, explainable, and socially responsible NLP systems for media bias detection.
arXiv Detail & Related papers (2025-05-19T11:54:39Z)
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors [61.92704516732144]
We show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior.<n>We propose two methods that leverage causal mechanisms to predict the correctness of model outputs.
arXiv Detail & Related papers (2025-05-17T00:31:39Z)
Fine-Grained Bias Detection in LLM: Enhancing detection mechanisms for nuanced biases [0.0]
This study presents a detection framework to identify nuanced biases in Large Language Models (LLMs)<n>The approach integrates contextual analysis, interpretability via attention mechanisms, and counterfactual data augmentation to capture hidden biases.<n>Results show improvements in detecting subtle biases compared to conventional methods.
arXiv Detail & Related papers (2025-03-08T04:43:01Z)
CONTESTS: a Framework for Consistency Testing of Span Probabilities in Language Models [16.436592723426305]
It is unclear whether language models produce the same value for different ways of assigning joint probabilities to word spans. Our work introduces a novel framework, ConTestS, involving statistical tests to assess score consistency across interchangeable completion and conditioning orders.
arXiv Detail & Related papers (2024-09-30T06:24:43Z)
Identifying and Mitigating Social Bias Knowledge in Language Models [52.52955281662332]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases.<n>FAST surpasses state-of-the-art baselines with superior debiasing performance.<n>This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z)
Debiasing Stance Detection Models with Counterfactual Reasoning and Adversarial Bias Learning [15.68462203989933]
Stance detection models tend to rely on dataset bias in the text part as a shortcut. We propose an adversarial bias learning module to model the bias more accurately.
arXiv Detail & Related papers (2022-12-20T16:20:56Z)
General Greedy De-bias Learning [163.65789778416172]
We propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model like gradient descent in functional space. GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
arXiv Detail & Related papers (2021-12-20T14:47:32Z)
Balancing out Bias: Achieving Fairness Through Training Reweighting [58.201275105195485]
Bias in natural language processing arises from models learning characteristics of the author such as gender and race. Existing methods for mitigating and measuring bias do not directly account for correlations between author demographics and linguistic variables. This paper introduces a very simple but highly effective method for countering bias using instance reweighting.
arXiv Detail & Related papers (2021-09-16T23:40:28Z)
Improving Robustness by Augmenting Training Sentences with Predicate-Argument Structures [62.562760228942054]
Existing approaches to improve robustness against dataset biases mostly focus on changing the training objective. We propose to augment the input sentences in the training data with their corresponding predicate-argument structures. We show that without targeting a specific bias, our sentence augmentation improves the robustness of transformer models against multiple biases.
arXiv Detail & Related papers (2020-10-23T16:22:05Z)
LOGAN: Local Group Bias Detection by Clustering [86.38331353310114]
We argue that evaluating bias at the corpus level is not enough for understanding how biases are embedded in a model. We propose LOGAN, a new bias detection technique based on clustering. Experiments on toxicity classification and object classification tasks show that LOGAN identifies bias in a local region.
arXiv Detail & Related papers (2020-10-06T16:42:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.