Feature Attributions and Counterfactual Explanations Can Be Manipulated
- URL: http://arxiv.org/abs/2106.12563v1
- Date: Wed, 23 Jun 2021 17:43:31 GMT
- Title: Feature Attributions and Counterfactual Explanations Can Be Manipulated
- Authors: Dylan Slack, Sophie Hilgard, Sameer Singh, Hima Lakkaraju
- Abstract summary: We show how adversaries can design biased models that manipulate model agnostic feature attribution methods.
These vulnerabilities allow an adversary to deploy a biased model, yet explanations will not reveal this bias, thereby deceiving stakeholders into trusting the model.
We evaluate the manipulations on real world data sets, including COMPAS and Communities & Crime, and find explanations can be manipulated in practice.
- Score: 32.579094387004346
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As machine learning models are increasingly used in critical decision-making
settings (e.g., healthcare, finance), there has been a growing emphasis on
developing methods to explain model predictions. Such \textit{explanations} are
used to understand and establish trust in models and are vital components in
machine learning pipelines. Though explanations are a critical piece in these
systems, there is little understanding about how they are vulnerable to
manipulation by adversaries. In this paper, we discuss how two broad classes of
explanations are vulnerable to manipulation. We demonstrate how adversaries can
design biased models that manipulate model agnostic feature attribution methods
(e.g., LIME \& SHAP) and counterfactual explanations that hill-climb during the
counterfactual search (e.g., Wachter's Algorithm \& DiCE) into
\textit{concealing} the model's biases. These vulnerabilities allow an
adversary to deploy a biased model, yet explanations will not reveal this bias,
thereby deceiving stakeholders into trusting the model. We evaluate the
manipulations on real world data sets, including COMPAS and Communities \&
Crime, and find explanations can be manipulated in practice.
Related papers
- Explainability for Large Language Models: A Survey [59.67574757137078]
Large language models (LLMs) have demonstrated impressive capabilities in natural language processing.
This paper introduces a taxonomy of explainability techniques and provides a structured overview of methods for explaining Transformer-based language models.
arXiv Detail & Related papers (2023-09-02T22:14:26Z) - Explaining Explainability: Towards Deeper Actionable Insights into Deep
Learning through Second-order Explainability [70.60433013657693]
Second-order explainable AI (SOXAI) was recently proposed to extend explainable AI (XAI) from the instance level to the dataset level.
We demonstrate for the first time, via example classification and segmentation cases, that eliminating irrelevant concepts from the training set based on actionable insights from SOXAI can enhance a model's performance.
arXiv Detail & Related papers (2023-06-14T23:24:01Z) - Interpretable Data-Based Explanations for Fairness Debugging [7.266116143672294]
Gopher is a system that produces compact, interpretable, and causal explanations for bias or unexpected model behavior.
We introduce the concept of causal responsibility that quantifies the extent to which intervening on training data by removing or updating subsets of it can resolve the bias.
Building on this concept, we develop an efficient approach for generating the top-k patterns that explain model bias.
arXiv Detail & Related papers (2021-12-17T20:10:00Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z) - Beyond Trivial Counterfactual Explanations with Diverse Valuable
Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction.
We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss.
Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z) - Better sampling in explanation methods can prevent dieselgate-like
deception [0.0]
Interpretability of prediction models is necessary to determine their biases and causes of errors.
Popular techniques, such as IME, LIME, and SHAP, use perturbation of instance features to explain individual predictions.
We show that the improved sampling increases the robustness of the LIME and SHAP, while previously untested method IME is already the most robust of all.
arXiv Detail & Related papers (2021-01-26T13:41:37Z) - Explainable Artificial Intelligence: How Subsets of the Training Data
Affect a Prediction [2.3204178451683264]
We propose a novel methodology which we call Shapley values for training data subset importance.
We show how the proposed explanations can be used to reveal biasedness in models and erroneous training data.
We argue that the explanations enable us to perceive more of the inner workings of the algorithms, and illustrate how models producing similar predictions can be based on very different parts of the training data.
arXiv Detail & Related papers (2020-12-07T12:15:47Z) - Deducing neighborhoods of classes from a fitted model [68.8204255655161]
In this article a new kind of interpretable machine learning method is presented.
It can help to understand the partitioning of the feature space into predicted classes in a classification model using quantile shifts.
Basically, real data points (or specific points of interest) are used and the changes of the prediction after slightly raising or decreasing specific features are observed.
arXiv Detail & Related papers (2020-09-11T16:35:53Z) - Model extraction from counterfactual explanations [68.8204255655161]
We show how an adversary can leverage the information provided by counterfactual explanations to build high-fidelity and high-accuracy model extraction attacks.
Our attack enables the adversary to build a faithful copy of a target model by accessing its counterfactual explanations.
arXiv Detail & Related papers (2020-09-03T19:02:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.