Model extraction from counterfactual explanations
- URL: http://arxiv.org/abs/2009.01884v1
- Date: Thu, 3 Sep 2020 19:02:55 GMT
- Title: Model extraction from counterfactual explanations
- Authors: Ulrich A\"ivodji, Alexandre Bolot, S\'ebastien Gambs
- Abstract summary: We show how an adversary can leverage the information provided by counterfactual explanations to build high-fidelity and high-accuracy model extraction attacks.
Our attack enables the adversary to build a faithful copy of a target model by accessing its counterfactual explanations.
- Score: 68.8204255655161
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Post-hoc explanation techniques refer to a posteriori methods that can be
used to explain how black-box machine learning models produce their outcomes.
Among post-hoc explanation techniques, counterfactual explanations are becoming
one of the most popular methods to achieve this objective. In particular, in
addition to highlighting the most important features used by the black-box
model, they provide users with actionable explanations in the form of data
instances that would have received a different outcome. Nonetheless, by doing
so, they also leak non-trivial information about the model itself, which raises
privacy issues. In this work, we demonstrate how an adversary can leverage the
information provided by counterfactual explanations to build high-fidelity and
high-accuracy model extraction attacks. More precisely, our attack enables the
adversary to build a faithful copy of a target model by accessing its
counterfactual explanations. The empirical evaluation of the proposed attack on
black-box models trained on real-world datasets demonstrates that they can
achieve high-fidelity and high-accuracy extraction even under low query
budgets.
Related papers
- Discriminative Feature Attributions: Bridging Post Hoc Explainability
and Inherent Interpretability [29.459228981179674]
Post hoc explanations incorrectly attribute high importance to features that are unimportant or non-discriminative for the underlying task.
Inherently interpretable models, on the other hand, circumvent these issues by explicitly encoding explanations into model architecture.
We propose Distractor Erasure Tuning (DiET), a method that adapts black-box models to be robust to distractor erasure.
arXiv Detail & Related papers (2023-07-27T17:06:02Z) - BELLA: Black box model Explanations by Local Linear Approximations [10.05944106581306]
We present BELLA, a deterministic model-agnostic post-hoc approach for explaining the individual predictions of regression black-box models.
BELLA provides explanations in the form of a linear model trained in the feature space.
BELLA can produce both factual and counterfactual explanations.
arXiv Detail & Related papers (2023-05-18T21:22:23Z) - Learning with Explanation Constraints [91.23736536228485]
We provide a learning theoretic framework to analyze how explanations can improve the learning of our models.
We demonstrate the benefits of our approach over a large array of synthetic and real-world experiments.
arXiv Detail & Related papers (2023-03-25T15:06:47Z) - Learning to Scaffold: Optimizing Model Explanations for Teaching [74.25464914078826]
We train models on three natural language processing and computer vision tasks.
We find that students trained with explanations extracted with our framework are able to simulate the teacher significantly more effectively than ones produced with previous methods.
arXiv Detail & Related papers (2022-04-22T16:43:39Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z) - Thief, Beware of What Get You There: Towards Understanding Model
Extraction Attack [13.28881502612207]
In some scenarios, AI models are trained proprietarily, where neither pre-trained models nor sufficient in-distribution data is publicly available.
We find the effectiveness of existing techniques significantly affected by the absence of pre-trained models.
We formulate model extraction attacks into an adaptive framework that captures these factors with deep reinforcement learning.
arXiv Detail & Related papers (2021-04-13T03:46:59Z) - Beyond Trivial Counterfactual Explanations with Diverse Valuable
Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction.
We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss.
Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z) - Explainers in the Wild: Making Surrogate Explainers Robust to
Distortions through Perception [77.34726150561087]
We propose a methodology to evaluate the effect of distortions in explanations by embedding perceptual distances.
We generate explanations for images in the Imagenet-C dataset and demonstrate how using a perceptual distances in the surrogate explainer creates more coherent explanations for the distorted and reference images.
arXiv Detail & Related papers (2021-02-22T12:38:53Z) - Explainable Deep Modeling of Tabular Data using TableGraphNet [1.376408511310322]
We propose a new architecture that produces explainable predictions in the form of additive feature attributions.
We show that our explainable model attains the same level of performance as black box models.
arXiv Detail & Related papers (2020-02-12T20:02:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.