Fooling SHAP with Stealthily Biased Sampling
- URL: http://arxiv.org/abs/2205.15419v1
- Date: Mon, 30 May 2022 20:33:46 GMT
- Title: Fooling SHAP with Stealthily Biased Sampling
- Authors: Gabriel Laberge, Ulrich A\"ivodji and Satoshi Hara
- Abstract summary: SHAP explanations aim at identifying which features contribute the most to the difference in model prediction at a specific input versus a background distribution.
Recent studies have shown that they can be manipulated by malicious adversaries to produce arbitrary desired explanations.
We propose a complementary family of attacks that leave the model intact and manipulate SHAP explanations using stealthily biased sampling of the data points used to approximate expectations w.r.t the background distribution.
- Score: 7.476901945542385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: SHAP explanations aim at identifying which features contribute the most to
the difference in model prediction at a specific input versus a background
distribution. Recent studies have shown that they can be manipulated by
malicious adversaries to produce arbitrary desired explanations. However,
existing attacks focus solely on altering the black-box model itself. In this
paper, we propose a complementary family of attacks that leave the model intact
and manipulate SHAP explanations using stealthily biased sampling of the data
points used to approximate expectations w.r.t the background distribution. In
the context of fairness audit, we show that our attack can reduce the
importance of a sensitive feature when explaining the difference in outcomes
between groups, while remaining undetected. These results highlight the
manipulability of SHAP explanations and encourage auditors to treat post-hoc
explanations with skepticism.
Related papers
- Indiscriminate Disruption of Conditional Inference on Multivariate Gaussians [60.22542847840578]
Despite advances in adversarial machine learning, inference for Gaussian models in the presence of an adversary is notably understudied.
We consider a self-interested attacker who wishes to disrupt a decisionmaker's conditional inference and subsequent actions by corrupting a set of evidentiary variables.
To avoid detection, the attacker also desires the attack to appear plausible wherein plausibility is determined by the density of the corrupted evidence.
arXiv Detail & Related papers (2024-11-21T17:46:55Z) - PASA: Attack Agnostic Unsupervised Adversarial Detection using Prediction & Attribution Sensitivity Analysis [2.5347892611213614]
Deep neural networks for classification are vulnerable to adversarial attacks, where small perturbations to input samples lead to incorrect predictions.
We develop a practical method for this characteristic of model prediction and feature attribution to detect adversarial samples.
Our approach demonstrates competitive performance even when an adversary is aware of the defense mechanism.
arXiv Detail & Related papers (2024-04-12T21:22:21Z) - Identifiable Latent Neural Causal Models [82.14087963690561]
Causal representation learning seeks to uncover latent, high-level causal representations from low-level observed data.
We determine the types of distribution shifts that do contribute to the identifiability of causal representations.
We translate our findings into a practical algorithm, allowing for the acquisition of reliable latent causal representations.
arXiv Detail & Related papers (2024-03-23T04:13:55Z) - Adversarial Counterfactual Visual Explanations [0.7366405857677227]
This paper proposes an elegant method to turn adversarial attacks into semantically meaningful perturbations.
The proposed approach hypothesizes that Denoising Diffusion Probabilistic Models are excellent regularizers for avoiding high-frequency and out-of-distribution perturbations.
arXiv Detail & Related papers (2023-03-17T13:34:38Z) - Extracting or Guessing? Improving Faithfulness of Event Temporal
Relation Extraction [87.04153383938969]
We improve the faithfulness of TempRel extraction models from two perspectives.
The first perspective is to extract genuinely based on contextual description.
The second perspective is to provide proper uncertainty estimation.
arXiv Detail & Related papers (2022-10-10T19:53:13Z) - Deconfounding to Explanation Evaluation in Graph Neural Networks [136.73451468551656]
We argue that a distribution shift exists between the full graph and the subgraph, causing the out-of-distribution problem.
We propose Deconfounded Subgraph Evaluation (DSE) which assesses the causal effect of an explanatory subgraph on the model prediction.
arXiv Detail & Related papers (2022-01-21T18:05:00Z) - Debiased Explainable Pairwise Ranking from Implicit Feedback [0.3867363075280543]
We focus on the state of the art pairwise ranking model, Bayesian Personalized Ranking (BPR)
BPR is a black box model that does not explain its outputs, thus limiting the user's trust in the recommendations.
We propose a novel explainable loss function and a corresponding Matrix Factorization-based model that generates recommendations along with item-based explanations.
arXiv Detail & Related papers (2021-07-30T17:19:37Z) - Adversarial Robustness through the Lens of Causality [105.51753064807014]
adversarial vulnerability of deep neural networks has attracted significant attention in machine learning.
We propose to incorporate causality into mitigating adversarial vulnerability.
Our method can be seen as the first attempt to leverage causality for mitigating adversarial vulnerability.
arXiv Detail & Related papers (2021-06-11T06:55:02Z) - Better sampling in explanation methods can prevent dieselgate-like
deception [0.0]
Interpretability of prediction models is necessary to determine their biases and causes of errors.
Popular techniques, such as IME, LIME, and SHAP, use perturbation of instance features to explain individual predictions.
We show that the improved sampling increases the robustness of the LIME and SHAP, while previously untested method IME is already the most robust of all.
arXiv Detail & Related papers (2021-01-26T13:41:37Z) - Model extraction from counterfactual explanations [68.8204255655161]
We show how an adversary can leverage the information provided by counterfactual explanations to build high-fidelity and high-accuracy model extraction attacks.
Our attack enables the adversary to build a faithful copy of a target model by accessing its counterfactual explanations.
arXiv Detail & Related papers (2020-09-03T19:02:55Z) - Towards Transparent and Explainable Attention Models [34.0557018891191]
We first explain why current attention mechanisms in LSTM based encoders can neither provide a faithful nor a plausible explanation of the model's predictions.
We propose a modified LSTM cell with a diversity-driven training objective that ensures that the hidden representations learned at different time steps are diverse.
Human evaluations indicate that the attention distributions learned by our model offer a plausible explanation of the model's predictions.
arXiv Detail & Related papers (2020-04-29T14:47:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.