Can Global XAI Methods Reveal Injected Bias in LLMs? SHAP vs Rule Extraction vs RuleSHAP
- URL: http://arxiv.org/abs/2505.11189v2
- Date: Tue, 23 Sep 2025 15:19:17 GMT
- Title: Can Global XAI Methods Reveal Injected Bias in LLMs? SHAP vs Rule Extraction vs RuleSHAP
- Authors: Francesco Sovrano,
- Abstract summary: Large language models (LLMs) can amplify misinformation, undermining societal goals like the UN.<n>We study three documented drivers of misinformation (valence framing, information overload) which are often shaped by one's default beliefs.<n>Building on evidence that LLMs encode defaults, we ask: can general belief-drivens behind misinformative behaviour be recovered from LLMs as clear rules?
- Score: 1.5567685129899713
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) can amplify misinformation, undermining societal goals like the UN SDGs. We study three documented drivers of misinformation (valence framing, information overload, and oversimplification) which are often shaped by one's default beliefs. Building on evidence that LLMs encode such defaults (e.g., "joy is positive," "math is complex") and can act as "bags of heuristics," we ask: can general belief-driven heuristics behind misinformative behaviour be recovered from LLMs as clear rules? A key obstacle is that global rule-extraction methods in explainable AI (XAI) are built for numerical inputs/outputs, not text. We address this by eliciting global LLM beliefs and mapping them to numerical scores via statistically reliable abstractions, thereby enabling off-the-shelf global XAI to detect belief-related heuristics in LLMs. To obtain ground truth, we hard-code bias-inducing nonlinear heuristics of increasing complexity (univariate, conjunctive, nonconvex) into popular LLMs (ChatGPT and Llama) via system instructions. This way, we find that RuleFit under-detects non-univariate biases, while global SHAP better approximates conjunctive ones but does not yield actionable rules. To bridge this gap, we propose RuleSHAP, a rule-extraction algorithm that couples global SHAP-value aggregations with rule induction to better capture non-univariate bias, improving heuristics detection over RuleFit by +94% (MRR@1) on average. Our results provide a practical pathway for revealing belief-driven biases in LLMs.
Related papers
- The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models [67.58848748317506]
Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs.<n>In this paper, we reveal a counter-intuitive reality: arbitrary order generation, in its current form, narrows rather than expands the reasoning boundary of dLLMs.
arXiv Detail & Related papers (2026-01-21T16:41:58Z) - Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z) - RLIE: Rule Generation with Logistic Regression, Iterative Refinement, and Evaluation for Large Language Models [13.343944091570386]
Large Language Models (LLMs) can propose rules in natural language, sidestepping the need for a predefined predicate space in traditional rule learning.<n>We present RLIE, a unified framework that integrates LLMs with probabilistic modeling to learn a set of weighted rules.<n>Applying rules directly with their learned weights yields superior performance, whereas prompting LLMs with the rules, weights, and logistic-model outputs surprisingly degrades accuracy.
arXiv Detail & Related papers (2025-10-22T15:50:04Z) - Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation [66.84286617519258]
Large language models are transforming social science research by enabling the automation of labor-intensive tasks like data annotation and text analysis.<n>Such variation can introduce systematic biases and random errors, which propagate to downstream analyses and cause Type I (false positive), Type II (false negative), Type S (wrong sign), or Type M (exaggerated effect) errors.<n>We find that intentional LLM hacking is strikingly simple. By replicating 37 data annotation tasks from 21 published social science studies, we show that, with just a handful of prompt paraphrases, virtually anything can be presented as statistically significant.
arXiv Detail & Related papers (2025-09-10T17:58:53Z) - DIF: A Framework for Benchmarking and Verifying Implicit Bias in LLMs [1.89915151018241]
We argue that implicit bias in Large Language Models (LLMs) is not only an ethical, but also a technical issue.<n>We developed a method for calculating an easily interpretable benchmark, DIF (Demographic Implicit Fairness)
arXiv Detail & Related papers (2025-05-15T06:53:37Z) - Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models [40.853803921563596]
Current large language models (LLMs) may still capture dataset biases and utilize them during inference.<n>We propose an information gain-guided causal intervention debiasing framework.<n>IGCIDB can effectively debias LLM to improve its generalizability across different tasks.
arXiv Detail & Related papers (2025-04-17T12:39:25Z) - Enough Coin Flips Can Make LLMs Act Bayesian [71.79085204454039]
Large language models (LLMs) exhibit the ability to generalize given few-shot examples in their input prompt, an emergent capability known as in-context learning (ICL)<n>We investigate whether LLMs use ICL to perform structured reasoning in ways that are consistent with a Bayesian framework or rely on pattern matching.
arXiv Detail & Related papers (2025-03-06T18:59:23Z) - Patterns Over Principles: The Fragility of Inductive Reasoning in LLMs under Noisy Observations [43.491353243991284]
We introduce Robust Rule Induction, a task that evaluates large language models' capability in inferring rules from data that are fused with noisy examples.<n> Experiments across arithmetic, cryptography, and list functions reveal: (1) SRR outperforms other methods with minimal performance degradation under noise; (2) Despite slight accuracy variation, LLMs exhibit instability under noise; and (3) Counterfactual task gaps highlight LLMs' reliance on memorized patterns over genuine abstraction.
arXiv Detail & Related papers (2025-02-22T10:03:19Z) - Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection [49.15148871877941]
Next-token distribution outputs offer a theoretically appealing approach for detection of large language models (LLMs)<n>We propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length.<n>PAWN shows competitive and even better performance in-distribution than the strongest baselines with a fraction of their trainable parameters.
arXiv Detail & Related papers (2025-01-07T17:00:49Z) - Are LLMs Good Zero-Shot Fallacy Classifiers? [24.3005882003251]
We focus on leveraging Large Language Models (LLMs) for zero-shot fallacy classification.
With comprehensive experiments on benchmark datasets, we suggest that LLMs could be potential zero-shot fallacy classifiers.
Our novel multi-round prompting schemes can effectively bring about more improvements, especially for small LLMs.
arXiv Detail & Related papers (2024-10-19T09:38:55Z) - WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents [55.64361927346957]
We propose a neurosymbolic approach to learn rules gradient-free through large language models (LLMs)
Our embodied LLM agent "WALL-E" is built upon model-predictive control (MPC)
On open-world challenges in Minecraft and ALFWorld, WALL-E achieves higher success rates than existing methods.
arXiv Detail & Related papers (2024-10-09T23:37:36Z) - Beyond Instruction Following: Evaluating Inferential Rule Following of Large Language Models [25.337295202341608]
Large Language Models (LLMs) are supposed to be controlled and guided by in real-world scenarios to be safe, accurate, and intelligent.
Previous studies that try to evaluate the inferential rule-following capability of LLMs fail to distinguish the inferential rule-following scenarios from the instruction-following scenarios.
This paper first clarifies the concept of inferential rule-following and proposes a comprehensive benchmark, RuleBench, to evaluate a diversified range of inferential rule-following abilities.
arXiv Detail & Related papers (2024-07-11T12:26:55Z) - Anomaly Detection of Tabular Data Using LLMs [54.470648484612866]
We show that pre-trained large language models (LLMs) are zero-shot batch-level anomaly detectors.
We propose an end-to-end fine-tuning strategy to bring out the potential of LLMs in detecting real anomalies.
arXiv Detail & Related papers (2024-06-24T04:17:03Z) - A Probabilistic Framework for LLM Hallucination Detection via Belief Tree Propagation [72.93327642336078]
We propose Belief Tree Propagation (BTProp), a probabilistic framework for hallucination detection.<n>BTProp introduces a belief tree of logically related statements by decomposing a parent statement into child statements.<n>Our method improves baselines by 3%-9% (evaluated by AUROC and AUC-PR) on multiple hallucination detection benchmarks.
arXiv Detail & Related papers (2024-06-11T05:21:37Z) - Self-Debiasing Large Language Models: Zero-Shot Recognition and
Reduction of Stereotypes [73.12947922129261]
We leverage the zero-shot capabilities of large language models to reduce stereotyping.
We show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups.
We hope this work opens inquiry into other zero-shot techniques for bias mitigation.
arXiv Detail & Related papers (2024-02-03T01:40:11Z) - Local Universal Explainer (LUX) -- a rule-based explainer with factual, counterfactual and visual explanations [7.673339435080445]
Local Universal Explainer (LUX) is a rule-based explainer that can generate factual, counterfactual and visual explanations.
It is based on a modified version of decision tree algorithms that allows for oblique splits and integration with feature importance XAI methods such as SHAP.
We tested our method on real and synthetic datasets and compared it with state-of-the-art rule-based explainers such as LORE, EXPLAN and Anchor.
arXiv Detail & Related papers (2023-10-23T13:04:15Z) - SeqXGPT: Sentence-Level AI-Generated Text Detection [62.3792779440284]
We introduce a sentence-level detection challenge by synthesizing documents polished with large language models (LLMs)
We then propose textbfSequence textbfX (Check) textbfGPT, a novel method that utilizes log probability lists from white-box LLMs as features for sentence-level AIGT detection.
arXiv Detail & Related papers (2023-10-13T07:18:53Z) - Machine Learning with Probabilistic Law Discovery: A Concise
Introduction [77.34726150561087]
Probabilistic Law Discovery (PLD) is a logic based Machine Learning method, which implements a variant of probabilistic rule learning.
PLD is close to Decision Tree/Random Forest methods, but it differs significantly in how relevant rules are defined.
This paper outlines the main principles of PLD, highlight its benefits and limitations and provide some application guidelines.
arXiv Detail & Related papers (2022-12-22T17:40:13Z) - Utilizing XAI technique to improve autoencoder based model for computer
network anomaly detection with shapley additive explanation(SHAP) [0.0]
Machine learning (ML) and Deep Learning (DL) methods are being adopted rapidly, especially in computer network security.
Lack of transparency of ML and DL based models is a major obstacle to their implementation and criticized due to its black-box nature.
XAI is a promising area that can improve the trustworthiness of these models by giving explanations and interpreting its output.
arXiv Detail & Related papers (2021-12-14T09:42:04Z) - Evading the Simplicity Bias: Training a Diverse Set of Models Discovers
Solutions with Superior OOD Generalization [93.8373619657239]
Neural networks trained with SGD were recently shown to rely preferentially on linearly-predictive features.
This simplicity bias can explain their lack of robustness out of distribution (OOD)
We demonstrate that the simplicity bias can be mitigated and OOD generalization improved.
arXiv Detail & Related papers (2021-05-12T12:12:24Z) - Stratified Rule-Aware Network for Abstract Visual Reasoning [46.015682319351676]
Raven's Progressive Matrices (RPM) test is typically used to examine the capability of abstract reasoning.
Recent studies, taking advantage of Convolutional Neural Networks (CNNs), have achieved encouraging progress to accomplish the RPM test.
We propose a Stratified Rule-Aware Network (SRAN) to generate the rule embeddings for two input sequences.
arXiv Detail & Related papers (2020-02-17T08:44:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.