Related papers: Towards Robust and Accurate Stability Estimation of Local Surrogate Models in Text-based Explainable AI

Towards Robust and Accurate Stability Estimation of Local Surrogate Models in Text-based Explainable AI

URL: http://arxiv.org/abs/2501.02042v1
Date: Fri, 03 Jan 2025 17:44:57 GMT
Title: Towards Robust and Accurate Stability Estimation of Local Surrogate Models in Text-based Explainable AI
Authors: Christopher Burger, Charles Walter, Thai Le, Lingwei Chen,
Abstract summary: In adversarial attacks on explainable AI (XAI) in the NLP domain, the generated explanation is manipulated.<n>Central to this XAI manipulation is the similarity measure used to calculate how one explanation differs from another.<n>This work investigates a variety of similarity measures designed for text-based ranked lists to determine their comparative suitability for use.
Score: 9.31572645030282
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent work has investigated the concept of adversarial attacks on explainable AI (XAI) in the NLP domain with a focus on examining the vulnerability of local surrogate methods such as Lime to adversarial perturbations or small changes on the input of a machine learning (ML) model. In such attacks, the generated explanation is manipulated while the meaning and structure of the original input remain similar under the ML model. Such attacks are especially alarming when XAI is used as a basis for decision making (e.g., prescribing drugs based on AI medical predictors) or for legal action (e.g., legal dispute involving AI software). Although weaknesses across many XAI methods have been shown to exist, the reasons behind why remain little explored. Central to this XAI manipulation is the similarity measure used to calculate how one explanation differs from another. A poor choice of similarity measure can lead to erroneous conclusions about the stability or adversarial robustness of an XAI method. Therefore, this work investigates a variety of similarity measures designed for text-based ranked lists referenced in related work to determine their comparative suitability for use. We find that many measures are overly sensitive, resulting in erroneous estimates of stability. We then propose a weighting scheme for text-based data that incorporates the synonymity between the features within an explanation, providing more accurate estimates of the actual weakness of XAI methods to adversarial examples.

Related papers

A Chaos Driven Metric for Backdoor Attack Detection [1.534667887016089]
The work proposes a novel defense mechanism against one of the most significant attack vectors of AI models - the backdoor attack via data poisoning of training datasets.<n>In this defense technique, an integrated approach that combines chaos theory with manifold learning is proposed.<n>A novel metric - Precision Matrix Dependency Score (PDS) that is based on the conditional variance of Neurochaos features is formulated.
arXiv Detail & Related papers (2025-05-06T05:51:27Z)
Improving Robustness Estimates in Natural Language Explainable AI though Synonymity Weighted Similarity Measures [0.0]
adversarial examples have been prominent in the literature surrounding the effectiveness of XAI.<n>For explanations in natural language, it is natural to use measures found in the domain of information retrieval for use with ranked lists.<n>We show that the standard implementation of these measures are poorly suited for the comparison of explanations in adversarial XAI.
arXiv Detail & Related papers (2025-01-02T19:49:04Z)
F-Fidelity: A Robust Framework for Faithfulness Evaluation of Explainable AI [15.314388210699443]
XAI techniques can extract meaningful insights from deep learning models. How to properly evaluate them remains an open problem. We propose Fine-tuned Fidelity (F-Fidelity) as a robust evaluation framework for XAI.
arXiv Detail & Related papers (2024-10-03T20:23:06Z)
The Effect of Similarity Measures on Accurate Stability Estimates for Local Surrogate Models in Text-based Explainable AI [8.23094630594374]
A poor choice of similarity measure can result in erroneous conclusions on the efficacy of an XAI method. We investigate a variety of similarity measures designed for text-based ranked lists including Kendall's Tau, Spearman's Footrule and Rank-biased Overlap.
arXiv Detail & Related papers (2024-06-22T12:59:12Z)
Are Objective Explanatory Evaluation metrics Trustworthy? An Adversarial Analysis [12.921307214813357]
The aim of the paper is to come up with a novel explanatory technique called SHifted Adversaries using Pixel Elimination. We show that SHAPE is, infact, an adversarial explanation that fools causal metrics that are employed to measure the robustness and reliability of popular importance based visual XAI methods.
arXiv Detail & Related papers (2024-06-12T02:39:46Z)
Adversarial attacks and defenses in explainable artificial intelligence: A survey [11.541601343587917]
Recent advances in adversarial machine learning (AdvML) highlight the limitations and vulnerabilities of state-of-the-art explanation methods. This survey provides a comprehensive overview of research concerning adversarial attacks on explanations of machine learning models.
arXiv Detail & Related papers (2023-06-06T09:53:39Z)
An Experimental Investigation into the Evaluation of Explainability Methods [60.54170260771932]
This work compares 14 different metrics when applied to nine state-of-the-art XAI methods and three dummy methods (e.g., random saliency maps) used as references. Experimental results show which of these metrics produces highly correlated results, indicating potential redundancy.
arXiv Detail & Related papers (2023-05-25T08:07:07Z)
Semantic Image Attack for Visual Model Diagnosis [80.36063332820568]
In practice, metric analysis on a specific train and test dataset does not guarantee reliable or fair ML models. This paper proposes Semantic Image Attack (SIA), a method based on the adversarial attack that provides semantic adversarial images.
arXiv Detail & Related papers (2023-03-23T03:13:04Z)
In and Out-of-Domain Text Adversarial Robustness via Label Smoothing [64.66809713499576]
We study the adversarial robustness provided by various label smoothing strategies in foundational models for diverse NLP tasks. Our experiments show that label smoothing significantly improves adversarial robustness in pre-trained models like BERT, against various popular attacks. We also analyze the relationship between prediction confidence and robustness, showing that label smoothing reduces over-confident errors on adversarial examples.
arXiv Detail & Related papers (2022-12-20T14:06:50Z)
ADDMU: Detection of Far-Boundary Adversarial Examples with Data and Model Uncertainty Estimation [125.52743832477404]
Adversarial Examples Detection (AED) is a crucial defense technique against adversarial attacks. We propose a new technique, textbfADDMU, which combines two types of uncertainty estimation for both regular and FB adversarial example detection. Our new method outperforms previous methods by 3.6 and 6.0 emphAUC points under each scenario.
arXiv Detail & Related papers (2022-10-22T09:11:12Z)
Neural Causal Models for Counterfactual Identification and Estimation [62.30444687707919]
We study the evaluation of counterfactual statements through neural models. First, we show that neural causal models (NCMs) are expressive enough. Second, we develop an algorithm for simultaneously identifying and estimating counterfactual distributions.
arXiv Detail & Related papers (2022-09-30T18:29:09Z)
A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks [72.7373468905418]
We develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning. We also propose CUBE, a simple yet strong clustering-based defense baseline.
arXiv Detail & Related papers (2022-06-17T02:29:23Z)
Detecting Word Sense Disambiguation Biases in Machine Translation for Model-Agnostic Adversarial Attacks [84.61578555312288]
We introduce a method for the prediction of disambiguation errors based on statistical data properties. We develop a simple adversarial attack strategy that minimally perturbs sentences in order to elicit disambiguation errors. Our findings indicate that disambiguation robustness varies substantially between domains and that different models trained on the same data are vulnerable to different attacks.
arXiv Detail & Related papers (2020-11-03T17:01:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.