The Effect of Similarity Measures on Accurate Stability Estimates for Local Surrogate Models in Text-based Explainable AI
- URL: http://arxiv.org/abs/2406.15839v1
- Date: Sat, 22 Jun 2024 12:59:12 GMT
- Title: The Effect of Similarity Measures on Accurate Stability Estimates for Local Surrogate Models in Text-based Explainable AI
- Authors: Christopher Burger, Charles Walter, Thai Le,
- Abstract summary: A poor choice of similarity measure can result in erroneous conclusions on the efficacy of an XAI method.
We investigate a variety of similarity measures designed for text-based ranked lists including Kendall's Tau, Spearman's Footrule and Rank-biased Overlap.
- Score: 8.23094630594374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work has investigated the vulnerability of local surrogate methods to adversarial perturbations on a machine learning (ML) model's inputs, where the explanation is manipulated while the meaning and structure of the original input remains similar under the complex model. While weaknesses across many methods have been shown to exist, the reasons behind why still remain little explored. Central to the concept of adversarial attacks on explainable AI (XAI) is the similarity measure used to calculate how one explanation differs from another A poor choice of similarity measure can result in erroneous conclusions on the efficacy of an XAI method. Too sensitive a measure results in exaggerated vulnerability, while too coarse understates its weakness. We investigate a variety of similarity measures designed for text-based ranked lists including Kendall's Tau, Spearman's Footrule and Rank-biased Overlap to determine how substantial changes in the type of measure or threshold of success affect the conclusions generated from common adversarial attack processes. Certain measures are found to be overly sensitive, resulting in erroneous estimates of stability.
Related papers
- Poisoning Bayesian Inference via Data Deletion and Replication [0.4915744683251149]
We focus on extending the white-box model poisoning paradigm to attack generic Bayesian inference.
A suite of attacks are developed that allow an attacker to steer the Bayesian posterior toward a target distribution.
With relatively little effort, the attacker is able to substantively alter the Bayesian's beliefs and, by accepting more risk, they can mold these beliefs to their will.
arXiv Detail & Related papers (2025-03-06T14:30:15Z) - Improving Stability Estimates in Adversarial Explainable AI through Alternate Search Methods [0.0]
Local surrogate methods have been used to approximate the workings of complex machine learning models.
Recent work has revealed their vulnerability to adversarial attacks where the explanation produced is appreciably different.
Here we explore using an alternate search method with the goal of finding minimum viable perturbations.
arXiv Detail & Related papers (2025-01-15T18:45:05Z) - Towards Robust and Accurate Stability Estimation of Local Surrogate Models in Text-based Explainable AI [9.31572645030282]
In adversarial attacks on explainable AI (XAI) in the NLP domain, the generated explanation is manipulated.
Central to this XAI manipulation is the similarity measure used to calculate how one explanation differs from another.
This work investigates a variety of similarity measures designed for text-based ranked lists to determine their comparative suitability for use.
arXiv Detail & Related papers (2025-01-03T17:44:57Z) - Detecting Adversarial Attacks in Semantic Segmentation via Uncertainty Estimation: A Deep Analysis [12.133306321357999]
We propose an uncertainty-based method for detecting adversarial attacks on neural networks for semantic segmentation.
We conduct a detailed analysis of uncertainty-based detection of adversarial attacks and various state-of-the-art neural networks.
Our numerical experiments show the effectiveness of the proposed uncertainty-based detection method.
arXiv Detail & Related papers (2024-08-19T14:13:30Z) - Uncertainty in Additive Feature Attribution methods [34.80932512496311]
We focus on the class of additive feature attribution explanation methods.
We study the relationship between a feature's attribution and its uncertainty and observe little correlation.
We coin the term "stable instances" for such instances and diagnose factors that make an instance stable.
arXiv Detail & Related papers (2023-11-29T08:40:46Z) - Causal Fair Metric: Bridging Causality, Individual Fairness, and
Adversarial Robustness [7.246701762489971]
Adversarial perturbation, used to identify vulnerabilities in models, and individual fairness, aiming for equitable treatment of similar individuals, both depend on metrics to generate comparable input data instances.
Previous attempts to define such joint metrics often lack general assumptions about data or structural causal models and were unable to reflect counterfactual proximity.
This paper introduces a causal fair metric formulated based on causal structures encompassing sensitive attributes and protected causal perturbation.
arXiv Detail & Related papers (2023-10-30T09:53:42Z) - An Experimental Investigation into the Evaluation of Explainability
Methods [60.54170260771932]
This work compares 14 different metrics when applied to nine state-of-the-art XAI methods and three dummy methods (e.g., random saliency maps) used as references.
Experimental results show which of these metrics produces highly correlated results, indicating potential redundancy.
arXiv Detail & Related papers (2023-05-25T08:07:07Z) - A Practical Upper Bound for the Worst-Case Attribution Deviations [21.341303776931532]
Model attribution is a critical component of deep neural networks (DNNs) for its interpretability to complex models.
Recent studies bring up attention to the security of attribution methods as they are vulnerable to attribution attacks that generate similar images with dramatically different attributions.
Existing works have been investigating empirically improving the robustness of DNNs against those attacks; however, none of them explicitly quantifies the actual deviations of attributions.
In this work, for the first time, a constrained optimization problem is formulated to derive an upper bound that measures the largest dissimilarity of attributions after the samples are perturbed by any noises within a certain region
arXiv Detail & Related papers (2023-03-01T09:07:27Z) - In and Out-of-Domain Text Adversarial Robustness via Label Smoothing [64.66809713499576]
We study the adversarial robustness provided by various label smoothing strategies in foundational models for diverse NLP tasks.
Our experiments show that label smoothing significantly improves adversarial robustness in pre-trained models like BERT, against various popular attacks.
We also analyze the relationship between prediction confidence and robustness, showing that label smoothing reduces over-confident errors on adversarial examples.
arXiv Detail & Related papers (2022-12-20T14:06:50Z) - Improving Adversarial Robustness to Sensitivity and Invariance Attacks
with Deep Metric Learning [80.21709045433096]
A standard method in adversarial robustness assumes a framework to defend against samples crafted by minimally perturbing a sample.
We use metric learning to frame adversarial regularization as an optimal transport problem.
Our preliminary results indicate that regularizing over invariant perturbations in our framework improves both invariant and sensitivity defense.
arXiv Detail & Related papers (2022-11-04T13:54:02Z) - Residual Error: a New Performance Measure for Adversarial Robustness [85.0371352689919]
A major challenge that limits the wide-spread adoption of deep learning has been their fragility to adversarial attacks.
This study presents the concept of residual error, a new performance measure for assessing the adversarial robustness of a deep neural network.
Experimental results using the case of image classification demonstrate the effectiveness and efficacy of the proposed residual error metric.
arXiv Detail & Related papers (2021-06-18T16:34:23Z) - Detecting Word Sense Disambiguation Biases in Machine Translation for
Model-Agnostic Adversarial Attacks [84.61578555312288]
We introduce a method for the prediction of disambiguation errors based on statistical data properties.
We develop a simple adversarial attack strategy that minimally perturbs sentences in order to elicit disambiguation errors.
Our findings indicate that disambiguation robustness varies substantially between domains and that different models trained on the same data are vulnerable to different attacks.
arXiv Detail & Related papers (2020-11-03T17:01:44Z) - Metrics and methods for robustness evaluation of neural networks with
generative models [0.07366405857677225]
Recently, especially in computer vision, researchers discovered "natural" or "semantic" perturbations, such as rotations, changes of brightness, or more high-level changes.
We propose several metrics to measure robustness of classifiers to natural adversarial examples, and methods to evaluate them.
arXiv Detail & Related papers (2020-03-04T10:58:59Z) - Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial
Perturbations [65.05561023880351]
Adversarial examples are malicious inputs crafted to induce misclassification.
This paper studies a complementary failure mode, invariance-based adversarial examples.
We show that defenses against sensitivity-based attacks actively harm a model's accuracy on invariance-based attacks.
arXiv Detail & Related papers (2020-02-11T18:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.