Plausible Counterfactuals: Auditing Deep Learning Classifiers with
Realistic Adversarial Examples
- URL: http://arxiv.org/abs/2003.11323v1
- Date: Wed, 25 Mar 2020 11:08:56 GMT
- Title: Plausible Counterfactuals: Auditing Deep Learning Classifiers with
Realistic Adversarial Examples
- Authors: Alejandro Barredo-Arrieta and Javier Del Ser
- Abstract summary: Black-box nature of Deep Learning models has posed unanswered questions about what they learn from data.
Generative Adversarial Network (GAN) and multi-objectives are used to furnish a plausible attack to the audited model.
Its utility is showcased within a human face classification task, unveiling the enormous potential of the proposed framework.
- Score: 84.8370546614042
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The last decade has witnessed the proliferation of Deep Learning models in
many applications, achieving unrivaled levels of predictive performance.
Unfortunately, the black-box nature of Deep Learning models has posed
unanswered questions about what they learn from data. Certain application
scenarios have highlighted the importance of assessing the bounds under which
Deep Learning models operate, a problem addressed by using assorted approaches
aimed at audiences from different domains. However, as the focus of the
application is placed more on non-expert users, it results mandatory to provide
the means for him/her to trust the model, just like a human gets familiar with
a system or process: by understanding the hypothetical circumstances under
which it fails. This is indeed the angular stone for this research work: to
undertake an adversarial analysis of a Deep Learning model. The proposed
framework constructs counterfactual examples by ensuring their plausibility,
e.g. there is a reasonable probability that a human could generate them without
resorting to a computer program. Therefore, this work must be regarded as
valuable auditing exercise of the usable bounds a certain model is constrained
within, thereby allowing for a much greater understanding of the capabilities
and pitfalls of a model used in a real application. To this end, a Generative
Adversarial Network (GAN) and multi-objective heuristics are used to furnish a
plausible attack to the audited model, efficiently trading between the
confusion of this model, the intensity and plausibility of the generated
counterfactual. Its utility is showcased within a human face classification
task, unveiling the enormous potential of the proposed framework.
Related papers
- Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance.
Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z) - Deep networks for system identification: a Survey [56.34005280792013]
System identification learns mathematical descriptions of dynamic systems from input-output data.
Main aim of the identified model is to predict new data from previous observations.
We discuss architectures commonly adopted in the literature, like feedforward, convolutional, and recurrent networks.
arXiv Detail & Related papers (2023-01-30T12:38:31Z) - Frugal Reinforcement-based Active Learning [12.18340575383456]
We propose a novel active learning approach for label-efficient training.
The proposed method is iterative and aims at minimizing a constrained objective function that mixes diversity, representativity and uncertainty criteria.
We also introduce a novel weighting mechanism based on reinforcement learning, which adaptively balances these criteria at each training iteration.
arXiv Detail & Related papers (2022-12-09T14:17:45Z) - Exploring the Trade-off between Plausibility, Change Intensity and
Adversarial Power in Counterfactual Explanations using Multi-objective
Optimization [73.89239820192894]
We argue that automated counterfactual generation should regard several aspects of the produced adversarial instances.
We present a novel framework for the generation of counterfactual examples.
arXiv Detail & Related papers (2022-05-20T15:02:53Z) - Towards Interpretable Deep Reinforcement Learning Models via Inverse
Reinforcement Learning [27.841725567976315]
We propose a novel framework utilizing Adversarial Inverse Reinforcement Learning.
This framework provides global explanations for decisions made by a Reinforcement Learning model.
We capture intuitive tendencies that the model follows by summarizing the model's decision-making process.
arXiv Detail & Related papers (2022-03-30T17:01:59Z) - When and How to Fool Explainable Models (and Humans) with Adversarial
Examples [1.439518478021091]
We explore the possibilities and limits of adversarial attacks for explainable machine learning models.
First, we extend the notion of adversarial examples to fit in explainable machine learning scenarios.
Next, we propose a comprehensive framework to study whether adversarial examples can be generated for explainable models.
arXiv Detail & Related papers (2021-07-05T11:20:55Z) - Thief, Beware of What Get You There: Towards Understanding Model
Extraction Attack [13.28881502612207]
In some scenarios, AI models are trained proprietarily, where neither pre-trained models nor sufficient in-distribution data is publicly available.
We find the effectiveness of existing techniques significantly affected by the absence of pre-trained models.
We formulate model extraction attacks into an adaptive framework that captures these factors with deep reinforcement learning.
arXiv Detail & Related papers (2021-04-13T03:46:59Z) - Explainable Adversarial Attacks in Deep Neural Networks Using Activation
Profiles [69.9674326582747]
This paper presents a visual framework to investigate neural network models subjected to adversarial examples.
We show how observing these elements can quickly pinpoint exploited areas in a model.
arXiv Detail & Related papers (2021-03-18T13:04:21Z) - Beyond Trivial Counterfactual Explanations with Diverse Valuable
Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction.
We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss.
Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.