Robust and Stable Black Box Explanations
- URL: http://arxiv.org/abs/2011.06169v1
- Date: Thu, 12 Nov 2020 02:29:03 GMT
- Title: Robust and Stable Black Box Explanations
- Authors: Himabindu Lakkaraju, Nino Arsov, Osbert Bastani
- Abstract summary: We propose a novel framework for generating robust and stable explanations of black box models.
We instantiate this algorithm for explanations in the form of linear models and decision sets.
- Score: 31.05743211871823
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As machine learning black boxes are increasingly being deployed in real-world
applications, there has been a growing interest in developing post hoc
explanations that summarize the behaviors of these black boxes. However,
existing algorithms for generating such explanations have been shown to lack
stability and robustness to distribution shifts. We propose a novel framework
for generating robust and stable explanations of black box models based on
adversarial training. Our framework optimizes a minimax objective that aims to
construct the highest fidelity explanation with respect to the worst-case over
a set of adversarial perturbations. We instantiate this algorithm for
explanations in the form of linear models and decision sets by devising the
required optimization procedures. To the best of our knowledge, this work makes
the first attempt at generating post hoc explanations that are robust to a
general class of adversarial perturbations that are of practical interest.
Experimental evaluation with real-world and synthetic datasets demonstrates
that our approach substantially improves robustness of explanations without
sacrificing their fidelity on the original data distribution.
Related papers
- Rigorous Probabilistic Guarantees for Robust Counterfactual Explanations [80.86128012438834]
We show for the first time that computing the robustness of counterfactuals with respect to plausible model shifts is NP-complete.
We propose a novel probabilistic approach which is able to provide tight estimates of robustness with strong guarantees.
arXiv Detail & Related papers (2024-07-10T09:13:11Z) - Discriminative Feature Attributions: Bridging Post Hoc Explainability
and Inherent Interpretability [29.459228981179674]
Post hoc explanations incorrectly attribute high importance to features that are unimportant or non-discriminative for the underlying task.
Inherently interpretable models, on the other hand, circumvent these issues by explicitly encoding explanations into model architecture.
We propose Distractor Erasure Tuning (DiET), a method that adapts black-box models to be robust to distractor erasure.
arXiv Detail & Related papers (2023-07-27T17:06:02Z) - Learning with Explanation Constraints [91.23736536228485]
We provide a learning theoretic framework to analyze how explanations can improve the learning of our models.
We demonstrate the benefits of our approach over a large array of synthetic and real-world experiments.
arXiv Detail & Related papers (2023-03-25T15:06:47Z) - Don't Explain Noise: Robust Counterfactuals for Randomized Ensembles [50.81061839052459]
We formalize the generation of robust counterfactual explanations as a probabilistic problem.
We show the link between the robustness of ensemble models and the robustness of base learners.
Our method achieves high robustness with only a small increase in the distance from counterfactual explanations to their initial observations.
arXiv Detail & Related papers (2022-05-27T17:28:54Z) - Interpretation of Black Box NLP Models: A Survey [0.0]
Post hoc explanations based on perturbations are widely used approaches to interpret a machine learning model after it has been built.
We propose S-LIME, which utilizes a hypothesis testing framework based on central limit theorem for determining the number of perturbation points needed to guarantee stability of the resulting explanation.
arXiv Detail & Related papers (2022-03-31T14:54:35Z) - On the Objective Evaluation of Post Hoc Explainers [10.981508361941335]
Modern trends in machine learning research have led to algorithms that are increasingly intricate to the degree that they are considered to be black boxes.
In an effort to reduce the opacity of decisions, methods have been proposed to construe the inner workings of such models in a human-comprehensible manner.
We propose a framework for the evaluation of post hoc explainers on ground truth that is directly derived from the additive structure of a model.
arXiv Detail & Related papers (2021-06-15T19:06:51Z) - S-LIME: Stabilized-LIME for Model Explanation [7.479279851480736]
Post hoc explanations based on perturbations are widely used approaches to interpret a machine learning model after it has been built.
We propose S-LIME, which utilizes a hypothesis testing framework based on central limit theorem for determining the number of perturbation points needed to guarantee stability of the resulting explanation.
arXiv Detail & Related papers (2021-06-15T04:24:59Z) - Beyond Trivial Counterfactual Explanations with Diverse Valuable
Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction.
We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss.
Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z) - Generative Counterfactuals for Neural Networks via Attribute-Informed
Perturbation [51.29486247405601]
We design a framework to generate counterfactuals for raw data instances with the proposed Attribute-Informed Perturbation (AIP)
By utilizing generative models conditioned with different attributes, counterfactuals with desired labels can be obtained effectively and efficiently.
Experimental results on real-world texts and images demonstrate the effectiveness, sample quality as well as efficiency of our designed framework.
arXiv Detail & Related papers (2021-01-18T08:37:13Z) - Model extraction from counterfactual explanations [68.8204255655161]
We show how an adversary can leverage the information provided by counterfactual explanations to build high-fidelity and high-accuracy model extraction attacks.
Our attack enables the adversary to build a faithful copy of a target model by accessing its counterfactual explanations.
arXiv Detail & Related papers (2020-09-03T19:02:55Z) - Reliable Post hoc Explanations: Modeling Uncertainty in Explainability [44.9824285459365]
Black box explanations are increasingly being employed to establish model credibility in high-stakes settings.
prior work demonstrates that explanations generated by state-of-the-art techniques are inconsistent, unstable, and provide very little insight into their correctness and reliability.
We develop a novel Bayesian framework for generating local explanations along with their associated uncertainty.
arXiv Detail & Related papers (2020-08-11T22:52:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.