Improving Stability Estimates in Adversarial Explainable AI through Alternate Search Methods
- URL: http://arxiv.org/abs/2501.09006v1
- Date: Wed, 15 Jan 2025 18:45:05 GMT
- Title: Improving Stability Estimates in Adversarial Explainable AI through Alternate Search Methods
- Authors: Christopher Burger, Charles Walter,
- Abstract summary: Local surrogate methods have been used to approximate the workings of complex machine learning models.
Recent work has revealed their vulnerability to adversarial attacks where the explanation produced is appreciably different.
Here we explore using an alternate search method with the goal of finding minimum viable perturbations.
- Score: 0.0
- License:
- Abstract: Advances in the effectiveness of machine learning models have come at the cost of enormous complexity resulting in a poor understanding of how they function. Local surrogate methods have been used to approximate the workings of these complex models, but recent work has revealed their vulnerability to adversarial attacks where the explanation produced is appreciably different while the meaning and structure of the complex model's output remains similar. This prior work has focused on the existence of these weaknesses but not on their magnitude. Here we explore using an alternate search method with the goal of finding minimum viable perturbations, the fewest perturbations necessary to achieve a fixed similarity value between the original and altered text's explanation. Intuitively, a method that requires fewer perturbations to expose a given level of instability is inferior to one which requires more. This nuance allows for superior comparisons of the stability of explainability methods.
Related papers
- The Effect of Similarity Measures on Accurate Stability Estimates for Local Surrogate Models in Text-based Explainable AI [8.23094630594374]
A poor choice of similarity measure can lead to erroneous conclusions on the efficacy of an XAI method.
We investigate a variety of similarity measures designed for text-based ranked lists, including Kendall's Tau, Spearman's Footrule, and Rank-biased Overlap.
arXiv Detail & Related papers (2024-06-22T12:59:12Z) - Exploring the Trade-off Between Model Performance and Explanation Plausibility of Text Classifiers Using Human Rationales [3.242050660144211]
Saliency post-hoc explainability methods are important tools for understanding increasingly complex NLP models.
We present a methodology for incorporating rationales, which are text annotations explaining human decisions, into text classification models.
arXiv Detail & Related papers (2024-04-03T22:39:33Z) - Uncertainty in Additive Feature Attribution methods [34.80932512496311]
We focus on the class of additive feature attribution explanation methods.
We study the relationship between a feature's attribution and its uncertainty and observe little correlation.
We coin the term "stable instances" for such instances and diagnose factors that make an instance stable.
arXiv Detail & Related papers (2023-11-29T08:40:46Z) - A Discrepancy Aware Framework for Robust Anomaly Detection [51.710249807397695]
We present a Discrepancy Aware Framework (DAF), which demonstrates robust performance consistently with simple and cheap strategies.
Our method leverages an appearance-agnostic cue to guide the decoder in identifying defects, thereby alleviating its reliance on synthetic appearance.
Under the simple synthesis strategies, it outperforms existing methods by a large margin. Furthermore, it also achieves the state-of-the-art localization performance.
arXiv Detail & Related papers (2023-10-11T15:21:40Z) - On the Robustness of Removal-Based Feature Attributions [17.679374058425346]
We theoretically characterize the properties of robustness of removal-based feature attributions.
Specifically, we provide a unified analysis of such methods and derive upper bounds for the difference between intact and perturbed attributions.
Our results on synthetic and real-world data validate our theoretical results and demonstrate their practical implications.
arXiv Detail & Related papers (2023-06-12T23:33:13Z) - Improving robustness of jet tagging algorithms with adversarial training [56.79800815519762]
We investigate the vulnerability of flavor tagging algorithms via application of adversarial attacks.
We present an adversarial training strategy that mitigates the impact of such simulated attacks.
arXiv Detail & Related papers (2022-03-25T19:57:19Z) - Robustness and Accuracy Could Be Reconcilable by (Proper) Definition [109.62614226793833]
The trade-off between robustness and accuracy has been widely studied in the adversarial literature.
We find that it may stem from the improperly defined robust error, which imposes an inductive bias of local invariance.
By definition, SCORE facilitates the reconciliation between robustness and accuracy, while still handling the worst-case uncertainty.
arXiv Detail & Related papers (2022-02-21T10:36:09Z) - Uncertainty-Aware Few-Shot Image Classification [118.72423376789062]
Few-shot image classification learns to recognize new categories from limited labelled data.
We propose Uncertainty-Aware Few-Shot framework for image classification.
arXiv Detail & Related papers (2020-10-09T12:26:27Z) - Learning explanations that are hard to vary [75.30552491694066]
We show that averaging across examples can favor memorization and patchwork' solutions that sew together different strategies.
We then propose and experimentally validate a simple alternative algorithm based on a logical AND.
arXiv Detail & Related papers (2020-09-01T10:17:48Z) - Reparameterized Variational Divergence Minimization for Stable Imitation [57.06909373038396]
We study the extent to which variations in the choice of probabilistic divergence may yield more performant ILO algorithms.
We contribute a re parameterization trick for adversarial imitation learning to alleviate the challenges of the promising $f$-divergence minimization framework.
Empirically, we demonstrate that our design choices allow for ILO algorithms that outperform baseline approaches and more closely match expert performance in low-dimensional continuous-control tasks.
arXiv Detail & Related papers (2020-06-18T19:04:09Z) - Robustness from Simple Classifiers [31.50446148110293]
We investigate the connection between robustness and simplicity.
We find that simpler classifiers, formed by reducing the number of output classes, are less susceptible to adversarial perturbations.
arXiv Detail & Related papers (2020-02-21T17:13:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.