Related papers: Birds look like cars: Adversarial analysis of intrinsically interpretable deep learning

Birds look like cars: Adversarial analysis of intrinsically interpretable deep learning

URL: http://arxiv.org/abs/2503.08636v1
Date: Tue, 11 Mar 2025 17:24:33 GMT
Title: Birds look like cars: Adversarial analysis of intrinsically interpretable deep learning
Authors: Hubert Baniecki, Przemyslaw Biecek,
Abstract summary: We highlight the risks related to overreliance and susceptibility to adversarial manipulation of "intrinsically" interpretable models by design.<n> Fooling the model's reasoning by exploiting its use of latent prototypes manifests the inherent uninterpretability of deep neural networks.<n>The reported limitations of prototype-based networks put their trustworthiness and applicability into question.
Score: 9.769695768744421
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A common belief is that intrinsically interpretable deep learning models ensure a correct, intuitive understanding of their behavior and offer greater robustness against accidental errors or intentional manipulation. However, these beliefs have not been comprehensively verified, and growing evidence casts doubt on them. In this paper, we highlight the risks related to overreliance and susceptibility to adversarial manipulation of these so-called "intrinsically (aka inherently) interpretable" models by design. We introduce two strategies for adversarial analysis with prototype manipulation and backdoor attacks against prototype-based networks, and discuss how concept bottleneck models defend against these attacks. Fooling the model's reasoning by exploiting its use of latent prototypes manifests the inherent uninterpretability of deep neural networks, leading to a false sense of security reinforced by a visual confirmation bias. The reported limitations of prototype-based networks put their trustworthiness and applicability into question, motivating further work on the robustness and alignment of (deep) interpretable models.

Related papers

Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [51.51850981481236]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.<n>PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.<n>To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z)
A Survey on Transferability of Adversarial Examples across Deep Neural Networks [53.04734042366312]
adversarial examples can manipulate machine learning models into making erroneous predictions. The transferability of adversarial examples enables black-box attacks which circumvent the need for detailed knowledge of the target model. This survey explores the landscape of the adversarial transferability of adversarial examples.
arXiv Detail & Related papers (2023-10-26T17:45:26Z)
Interpretations Cannot Be Trusted: Stealthy and Effective Adversarial Perturbations against Interpretable Deep Learning [16.13790238416691]
This work introduces two attacks, AdvEdge and AdvEdge$+$, that deceive both the target deep learning model and the coupled interpretation model. Our analysis shows the effectiveness of our attacks in terms of deceiving the deep learning models and their interpreters.
arXiv Detail & Related papers (2022-11-29T04:45:10Z)
Robust Transferable Feature Extractors: Learning to Defend Pre-Trained Networks Against White Box Adversaries [69.53730499849023]
We show that adversarial examples can be successfully transferred to another independently trained model to induce prediction errors. We propose a deep learning-based pre-processing mechanism, which we refer to as a robust transferable feature extractor (RTFE)
arXiv Detail & Related papers (2022-09-14T21:09:34Z)
Brittle interpretations: The Vulnerability of TCAV and Other Concept-based Explainability Tools to Adversarial Attack [0.0]
Methods for model explainability have become increasingly critical for testing the fairness and soundness of deep learning. We show that these methods can suffer the same vulnerability to adversarial attacks as the models they are meant to analyze.
arXiv Detail & Related papers (2021-10-14T02:12:33Z)
Attack to Fool and Explain Deep Networks [59.97135687719244]
We counter-argue by providing evidence of human-meaningful patterns in adversarial perturbations. Our major contribution is a novel pragmatic adversarial attack that is subsequently transformed into a tool to interpret the visual models.
arXiv Detail & Related papers (2021-06-20T03:07:36Z)
Detection Defense Against Adversarial Attacks with Saliency Map [7.736844355705379]
It is well established that neural networks are vulnerable to adversarial examples, which are almost imperceptible on human vision. Existing defenses are trend to harden the robustness of models against adversarial attacks. We propose a novel method combined with additional noises and utilize the inconsistency strategy to detect adversarial examples.
arXiv Detail & Related papers (2020-09-06T13:57:17Z)
Proper Network Interpretability Helps Adversarial Robustness in Classification [91.39031895064223]
We show that with a proper measurement of interpretation, it is difficult to prevent prediction-evasion adversarial attacks from causing interpretation discrepancy. We develop an interpretability-aware defensive scheme built only on promoting robust interpretation. We show that our defense achieves both robust classification and robust interpretation, outperforming state-of-the-art adversarial training methods against attacks of large perturbation.
arXiv Detail & Related papers (2020-06-26T01:31:31Z)
Adversarial Attacks and Defenses: An Interpretation Perspective [80.23908920686625]
We review recent work on adversarial attacks and defenses, particularly from the perspective of machine learning interpretation. The goal of model interpretation, or interpretable machine learning, is to extract human-understandable terms for the working mechanism of models. For each type of interpretation, we elaborate on how it could be used for adversarial attacks and defenses.
arXiv Detail & Related papers (2020-04-23T23:19:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.