Related papers: Brittle interpretations: The Vulnerability of TCAV and Other Concept-based Explainability Tools to Adversarial Attack

Brittle interpretations: The Vulnerability of TCAV and Other Concept-based Explainability Tools to Adversarial Attack

URL: http://arxiv.org/abs/2110.07120v1
Date: Thu, 14 Oct 2021 02:12:33 GMT
Title: Brittle interpretations: The Vulnerability of TCAV and Other Concept-based Explainability Tools to Adversarial Attack
Authors: Davis Brown and Henry Kvinge
Abstract summary: Methods for model explainability have become increasingly critical for testing the fairness and soundness of deep learning. We show that these methods can suffer the same vulnerability to adversarial attacks as the models they are meant to analyze.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Methods for model explainability have become increasingly critical for testing the fairness and soundness of deep learning. A number of explainability techniques have been developed which use a set of examples to represent a human-interpretable concept in a model's activations. In this work we show that these explainability methods can suffer the same vulnerability to adversarial attacks as the models they are meant to analyze. We demonstrate this phenomenon on two well-known concept-based approaches to the explainability of deep learning models: TCAV and faceted feature visualization. We show that by carefully perturbing the examples of the concept that is being investigated, we can radically change the output of the interpretability method, e.g. showing that stripes are not an important factor in identifying images of a zebra. Our work highlights the fact that in safety-critical applications, there is need for security around not only the machine learning pipeline but also the model interpretation process.

Related papers

A constraints-based approach to fully interpretable neural networks for detecting learner behaviors [0.6138671548064356]
We describe a novel approach to creating a neural-network-based behavior detection model that is interpretable by design.<n>Our model is fully interpretable, meaning that the parameters we extract for our explanations have a clear interpretation.<n>We train the model to detect gaming-the-system behavior, evaluate its performance on this task, and compare its learned patterns to those identified by human experts.
arXiv Detail & Related papers (2025-04-10T16:58:11Z)
Birds look like cars: Adversarial analysis of intrinsically interpretable deep learning [9.769695768744421]
We highlight the risks related to overreliance and susceptibility to adversarial manipulation of "intrinsically" interpretable models by design. Fooling the model's reasoning by exploiting its use of latent prototypes manifests the inherent uninterpretability of deep neural networks. The reported limitations of prototype-based networks put their trustworthiness and applicability into question.
arXiv Detail & Related papers (2025-03-11T17:24:33Z)
How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations [69.72654127617058]
Post-hoc importance attribution methods are a popular tool for "explaining" Deep Neural Networks (DNNs) In this work we bring forward empirical evidence that challenges this very notion. We discover a strong dependency on and demonstrate that the training details of a pre-trained model's classification layer play a crucial role.
arXiv Detail & Related papers (2025-03-01T22:25:11Z)
On the Fairness, Diversity and Reliability of Text-to-Image Generative Models [49.60774626839712]
multimodal generative models have sparked critical discussions on their fairness, reliability, and potential for misuse. We propose an evaluation framework designed to assess model reliability through their responses to perturbations in the embedding space. Our method lays the groundwork for detecting unreliable, bias-injected models and retrieval of bias provenance.
arXiv Detail & Related papers (2024-11-21T09:46:55Z)
Uncovering Unique Concept Vectors through Latent Space Decomposition [0.0]
Concept-based explanations have emerged as a superior approach that is more interpretable than feature attribution estimates. We propose a novel post-hoc unsupervised method that automatically uncovers the concepts learned by deep models during training. Our experiments reveal that the majority of our concepts are readily understandable to humans, exhibit coherency, and bear relevance to the task at hand.
arXiv Detail & Related papers (2023-07-13T17:21:54Z)
Explaining Explainability: Towards Deeper Actionable Insights into Deep Learning through Second-order Explainability [70.60433013657693]
Second-order explainable AI (SOXAI) was recently proposed to extend explainable AI (XAI) from the instance level to the dataset level. We demonstrate for the first time, via example classification and segmentation cases, that eliminating irrelevant concepts from the training set based on actionable insights from SOXAI can enhance a model's performance.
arXiv Detail & Related papers (2023-06-14T23:24:01Z)
Interpretations Cannot Be Trusted: Stealthy and Effective Adversarial Perturbations against Interpretable Deep Learning [16.13790238416691]
This work introduces two attacks, AdvEdge and AdvEdge$+$, that deceive both the target deep learning model and the coupled interpretation model. Our analysis shows the effectiveness of our attacks in terms of deceiving the deep learning models and their interpreters.
arXiv Detail & Related papers (2022-11-29T04:45:10Z)
Robust Transferable Feature Extractors: Learning to Defend Pre-Trained Networks Against White Box Adversaries [69.53730499849023]
We show that adversarial examples can be successfully transferred to another independently trained model to induce prediction errors. We propose a deep learning-based pre-processing mechanism, which we refer to as a robust transferable feature extractor (RTFE)
arXiv Detail & Related papers (2022-09-14T21:09:34Z)
Explainability in Process Outcome Prediction: Guidelines to Obtain Interpretable and Faithful Models [77.34726150561087]
We define explainability through the interpretability of the explanations and the faithfulness of the explainability model in the field of process outcome prediction. This paper contributes a set of guidelines named X-MOP which allows selecting the appropriate model based on the event log specifications.
arXiv Detail & Related papers (2022-03-30T05:59:50Z)
When and How to Fool Explainable Models (and Humans) with Adversarial Examples [1.439518478021091]
We explore the possibilities and limits of adversarial attacks for explainable machine learning models. First, we extend the notion of adversarial examples to fit in explainable machine learning scenarios. Next, we propose a comprehensive framework to study whether adversarial examples can be generated for explainable models.
arXiv Detail & Related papers (2021-07-05T11:20:55Z)
Robust Semantic Interpretability: Revisiting Concept Activation Vectors [0.0]
Interpretability methods for image classification attempt to expose whether the model is systematically biased or attending to the same cues as a human would. Our proposed Robust Concept Activation Vectors (RCAV) quantifies the effects of semantic concepts on individual model predictions and on model behavior as a whole.
arXiv Detail & Related papers (2021-04-06T20:14:59Z)
Beyond Trivial Counterfactual Explanations with Diverse Valuable Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction. We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss. Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z)
Explainers in the Wild: Making Surrogate Explainers Robust to Distortions through Perception [77.34726150561087]
We propose a methodology to evaluate the effect of distortions in explanations by embedding perceptual distances. We generate explanations for images in the Imagenet-C dataset and demonstrate how using a perceptual distances in the surrogate explainer creates more coherent explanations for the distorted and reference images.
arXiv Detail & Related papers (2021-02-22T12:38:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.