When and How to Fool Explainable Models (and Humans) with Adversarial
Examples
- URL: http://arxiv.org/abs/2107.01943v2
- Date: Fri, 7 Jul 2023 11:59:57 GMT
- Title: When and How to Fool Explainable Models (and Humans) with Adversarial
Examples
- Authors: Jon Vadillo, Roberto Santana and Jose A. Lozano
- Abstract summary: We explore the possibilities and limits of adversarial attacks for explainable machine learning models.
First, we extend the notion of adversarial examples to fit in explainable machine learning scenarios.
Next, we propose a comprehensive framework to study whether adversarial examples can be generated for explainable models.
- Score: 1.439518478021091
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Reliable deployment of machine learning models such as neural networks
continues to be challenging due to several limitations. Some of the main
shortcomings are the lack of interpretability and the lack of robustness
against adversarial examples or out-of-distribution inputs. In this exploratory
review, we explore the possibilities and limits of adversarial attacks for
explainable machine learning models. First, we extend the notion of adversarial
examples to fit in explainable machine learning scenarios, in which the inputs,
the output classifications and the explanations of the model's decisions are
assessed by humans. Next, we propose a comprehensive framework to study whether
(and how) adversarial examples can be generated for explainable models under
human assessment, introducing and illustrating novel attack paradigms. In
particular, our framework considers a wide range of relevant yet often ignored
factors such as the type of problem, the user expertise or the objective of the
explanations, in order to identify the attack strategies that should be adopted
in each scenario to successfully deceive the model (and the human). The
intention of these contributions is to serve as a basis for a more rigorous and
realistic study of adversarial examples in the field of explainable machine
learning.
Related papers
- A Survey on Transferability of Adversarial Examples across Deep Neural Networks [53.04734042366312]
adversarial examples can manipulate machine learning models into making erroneous predictions.
The transferability of adversarial examples enables black-box attacks which circumvent the need for detailed knowledge of the target model.
This survey explores the landscape of the adversarial transferability of adversarial examples.
arXiv Detail & Related papers (2023-10-26T17:45:26Z) - On the Connections between Counterfactual Explanations and Adversarial
Examples [14.494463243702908]
We make one of the first attempts at formalizing the connections between counterfactual explanations and adversarial examples.
Our analysis demonstrates that several popular counterfactual explanation and adversarial example generation methods are equivalent.
We empirically validate our theoretical findings using extensive experimentation with synthetic and real world datasets.
arXiv Detail & Related papers (2021-06-18T08:22:24Z) - Individual Explanations in Machine Learning Models: A Case Study on
Poverty Estimation [63.18666008322476]
Machine learning methods are being increasingly applied in sensitive societal contexts.
The present case study has two main objectives. First, to expose these challenges and how they affect the use of relevant and novel explanations methods.
And second, to present a set of strategies that mitigate such challenges, as faced when implementing explanation methods in a relevant application domain.
arXiv Detail & Related papers (2021-04-09T01:54:58Z) - Explainable Adversarial Attacks in Deep Neural Networks Using Activation
Profiles [69.9674326582747]
This paper presents a visual framework to investigate neural network models subjected to adversarial examples.
We show how observing these elements can quickly pinpoint exploited areas in a model.
arXiv Detail & Related papers (2021-03-18T13:04:21Z) - Beyond Trivial Counterfactual Explanations with Diverse Valuable
Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction.
We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss.
Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z) - On the Transferability of Adversarial Attacksagainst Neural Text
Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models.
We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models.
We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z) - Plausible Counterfactuals: Auditing Deep Learning Classifiers with
Realistic Adversarial Examples [84.8370546614042]
Black-box nature of Deep Learning models has posed unanswered questions about what they learn from data.
Generative Adversarial Network (GAN) and multi-objectives are used to furnish a plausible attack to the audited model.
Its utility is showcased within a human face classification task, unveiling the enormous potential of the proposed framework.
arXiv Detail & Related papers (2020-03-25T11:08:56Z) - A Hierarchy of Limitations in Machine Learning [0.0]
This paper attempts a comprehensive, structured overview of the specific conceptual, procedural, and statistical limitations of models in machine learning when applied to society.
Modelers themselves can use the described hierarchy to identify possible failure points and think through how to address them.
Consumers of machine learning models can know what to question when confronted with the decision about if, where, and how to apply machine learning.
arXiv Detail & Related papers (2020-02-12T19:39:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.