Related papers: Feature Attributions and Counterfactual Explanations Can Be Manipulated

Feature Attributions and Counterfactual Explanations Can Be Manipulated

URL: http://arxiv.org/abs/2106.12563v1
Date: Wed, 23 Jun 2021 17:43:31 GMT
Title: Feature Attributions and Counterfactual Explanations Can Be Manipulated
Authors: Dylan Slack, Sophie Hilgard, Sameer Singh, Hima Lakkaraju
Abstract summary: We show how adversaries can design biased models that manipulate model agnostic feature attribution methods. These vulnerabilities allow an adversary to deploy a biased model, yet explanations will not reveal this bias, thereby deceiving stakeholders into trusting the model. We evaluate the manipulations on real world data sets, including COMPAS and Communities & Crime, and find explanations can be manipulated in practice.
Score: 32.579094387004346
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As machine learning models are increasingly used in critical decision-making settings (e.g., healthcare, finance), there has been a growing emphasis on developing methods to explain model predictions. Such \textit{explanations} are used to understand and establish trust in models and are vital components in machine learning pipelines. Though explanations are a critical piece in these systems, there is little understanding about how they are vulnerable to manipulation by adversaries. In this paper, we discuss how two broad classes of explanations are vulnerable to manipulation. We demonstrate how adversaries can design biased models that manipulate model agnostic feature attribution methods (e.g., LIME \& SHAP) and counterfactual explanations that hill-climb during the counterfactual search (e.g., Wachter's Algorithm \& DiCE) into \textit{concealing} the model's biases. These vulnerabilities allow an adversary to deploy a biased model, yet explanations will not reveal this bias, thereby deceiving stakeholders into trusting the model. We evaluate the manipulations on real world data sets, including COMPAS and Communities \& Crime, and find explanations can be manipulated in practice.

Related papers

Explainability for Large Language Models: A Survey [59.67574757137078]
Large language models (LLMs) have demonstrated impressive capabilities in natural language processing. This paper introduces a taxonomy of explainability techniques and provides a structured overview of methods for explaining Transformer-based language models.
arXiv Detail & Related papers (2023-09-02T22:14:26Z)
Explaining Explainability: Towards Deeper Actionable Insights into Deep Learning through Second-order Explainability [70.60433013657693]
Second-order explainable AI (SOXAI) was recently proposed to extend explainable AI (XAI) from the instance level to the dataset level. We demonstrate for the first time, via example classification and segmentation cases, that eliminating irrelevant concepts from the training set based on actionable insights from SOXAI can enhance a model's performance.
arXiv Detail & Related papers (2023-06-14T23:24:01Z)
Are Data-driven Explanations Robust against Out-of-distribution Data? [18.760475318852375]
We propose an end-to-end model-agnostic learning framework Distributionally Robust Explanations (DRE) Key idea is to fully utilize the inter-distribution information to provide supervisory signals for the learning of explanations without human annotation. Our results demonstrate that the proposed method significantly improves the model's performance in terms of explanation and prediction robustness against distributional shifts.
arXiv Detail & Related papers (2023-03-29T02:02:08Z)
Interpretable Data-Based Explanations for Fairness Debugging [7.266116143672294]
Gopher is a system that produces compact, interpretable, and causal explanations for bias or unexpected model behavior. We introduce the concept of causal responsibility that quantifies the extent to which intervening on training data by removing or updating subsets of it can resolve the bias. Building on this concept, we develop an efficient approach for generating the top-k patterns that explain model bias.
arXiv Detail & Related papers (2021-12-17T20:10:00Z)
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
Beyond Trivial Counterfactual Explanations with Diverse Valuable Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction. We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss. Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z)
Better sampling in explanation methods can prevent dieselgate-like deception [0.0]
Interpretability of prediction models is necessary to determine their biases and causes of errors. Popular techniques, such as IME, LIME, and SHAP, use perturbation of instance features to explain individual predictions. We show that the improved sampling increases the robustness of the LIME and SHAP, while previously untested method IME is already the most robust of all.
arXiv Detail & Related papers (2021-01-26T13:41:37Z)
Explainable Artificial Intelligence: How Subsets of the Training Data Affect a Prediction [2.3204178451683264]
We propose a novel methodology which we call Shapley values for training data subset importance. We show how the proposed explanations can be used to reveal biasedness in models and erroneous training data. We argue that the explanations enable us to perceive more of the inner workings of the algorithms, and illustrate how models producing similar predictions can be based on very different parts of the training data.
arXiv Detail & Related papers (2020-12-07T12:15:47Z)
Deducing neighborhoods of classes from a fitted model [68.8204255655161]
In this article a new kind of interpretable machine learning method is presented. It can help to understand the partitioning of the feature space into predicted classes in a classification model using quantile shifts. Basically, real data points (or specific points of interest) are used and the changes of the prediction after slightly raising or decreasing specific features are observed.
arXiv Detail & Related papers (2020-09-11T16:35:53Z)
Model extraction from counterfactual explanations [68.8204255655161]
We show how an adversary can leverage the information provided by counterfactual explanations to build high-fidelity and high-accuracy model extraction attacks. Our attack enables the adversary to build a faithful copy of a target model by accessing its counterfactual explanations.
arXiv Detail & Related papers (2020-09-03T19:02:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.