On the Connections between Counterfactual Explanations and Adversarial
Examples
- URL: http://arxiv.org/abs/2106.09992v1
- Date: Fri, 18 Jun 2021 08:22:24 GMT
- Title: On the Connections between Counterfactual Explanations and Adversarial
Examples
- Authors: Martin Pawelczyk, Shalmali Joshi, Chirag Agarwal, Sohini Upadhyay,
Himabindu Lakkaraju
- Abstract summary: We make one of the first attempts at formalizing the connections between counterfactual explanations and adversarial examples.
Our analysis demonstrates that several popular counterfactual explanation and adversarial example generation methods are equivalent.
We empirically validate our theoretical findings using extensive experimentation with synthetic and real world datasets.
- Score: 14.494463243702908
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Counterfactual explanations and adversarial examples have emerged as critical
research areas for addressing the explainability and robustness goals of
machine learning (ML). While counterfactual explanations were developed with
the goal of providing recourse to individuals adversely impacted by algorithmic
decisions, adversarial examples were designed to expose the vulnerabilities of
ML models. While prior research has hinted at the commonalities between these
frameworks, there has been little to no work on systematically exploring the
connections between the literature on counterfactual explanations and
adversarial examples. In this work, we make one of the first attempts at
formalizing the connections between counterfactual explanations and adversarial
examples. More specifically, we theoretically analyze salient counterfactual
explanation and adversarial example generation methods, and highlight the
conditions under which they behave similarly. Our analysis demonstrates that
several popular counterfactual explanation and adversarial example generation
methods such as the ones proposed by Wachter et. al. and Carlini and Wagner
(with mean squared error loss), and C-CHVAE and natural adversarial examples by
Zhao et. al. are equivalent. We also bound the distance between counterfactual
explanations and adversarial examples generated by Wachter et. al. and DeepFool
methods for linear models. Finally, we empirically validate our theoretical
findings using extensive experimentation with synthetic and real world
datasets.
Related papers
- Towards Non-Adversarial Algorithmic Recourse [20.819764720587646]
It has been argued that adversarial examples, as opposed to counterfactual explanations, have a unique characteristic in that they lead to a misclassification compared to the ground truth.
We introduce non-adversarial algorithmic recourse and outline why in high-stakes situations, it is imperative to obtain counterfactual explanations that do not exhibit adversarial characteristics.
arXiv Detail & Related papers (2024-03-15T14:18:21Z) - Explainability for Large Language Models: A Survey [59.67574757137078]
Large language models (LLMs) have demonstrated impressive capabilities in natural language processing.
This paper introduces a taxonomy of explainability techniques and provides a structured overview of methods for explaining Transformer-based language models.
arXiv Detail & Related papers (2023-09-02T22:14:26Z) - On the Connection between Game-Theoretic Feature Attributions and
Counterfactual Explanations [14.552505966070358]
Two of the most popular explanations are feature attributions, and counterfactual explanations.
This work establishes a clear theoretical connection between game-theoretic feature attributions and counterfactuals explanations.
We shed light on the limitations of naively using counterfactual explanations to provide feature importances.
arXiv Detail & Related papers (2023-07-13T17:57:21Z) - Counterfactuals of Counterfactuals: a back-translation-inspired approach
to analyse counterfactual editors [3.4253416336476246]
We focus on the analysis of counterfactual, contrastive explanations.
We propose a new back translation-inspired evaluation methodology.
We show that by iteratively feeding the counterfactual to the explainer we can obtain valuable insights into the behaviour of both the predictor and the explainer models.
arXiv Detail & Related papers (2023-05-26T16:04:28Z) - A Frequency Perspective of Adversarial Robustness [72.48178241090149]
We present a frequency-based understanding of adversarial examples, supported by theoretical and empirical findings.
Our analysis shows that adversarial examples are neither in high-frequency nor in low-frequency components, but are simply dataset dependent.
We propose a frequency-based explanation for the commonly observed accuracy vs. robustness trade-off.
arXiv Detail & Related papers (2021-10-26T19:12:34Z) - Towards Explaining Adversarial Examples Phenomenon in Artificial Neural
Networks [8.31483061185317]
We study the adversarial examples existence and adversarial training from the standpoint of convergence.
We provide evidence that pointwise convergence in ANNs can explain these observations.
arXiv Detail & Related papers (2021-07-22T11:56:14Z) - When and How to Fool Explainable Models (and Humans) with Adversarial
Examples [1.439518478021091]
We explore the possibilities and limits of adversarial attacks for explainable machine learning models.
First, we extend the notion of adversarial examples to fit in explainable machine learning scenarios.
Next, we propose a comprehensive framework to study whether adversarial examples can be generated for explainable models.
arXiv Detail & Related papers (2021-07-05T11:20:55Z) - Adversarial Examples for Unsupervised Machine Learning Models [71.81480647638529]
Adrial examples causing evasive predictions are widely used to evaluate and improve the robustness of machine learning models.
We propose a framework of generating adversarial examples for unsupervised models and demonstrate novel applications to data augmentation.
arXiv Detail & Related papers (2021-03-02T17:47:58Z) - Advocating for Multiple Defense Strategies against Adversarial Examples [66.90877224665168]
It has been empirically observed that defense mechanisms designed to protect neural networks against $ell_infty$ adversarial examples offer poor performance.
In this paper we conduct a geometrical analysis that validates this observation.
Then, we provide a number of empirical insights to illustrate the effect of this phenomenon in practice.
arXiv Detail & Related papers (2020-12-04T14:42:46Z) - On the Transferability of Adversarial Attacksagainst Neural Text
Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models.
We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models.
We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z) - Learning explanations that are hard to vary [75.30552491694066]
We show that averaging across examples can favor memorization and patchwork' solutions that sew together different strategies.
We then propose and experimentally validate a simple alternative algorithm based on a logical AND.
arXiv Detail & Related papers (2020-09-01T10:17:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.