Related papers: Fooling SHAP with Output Shuffling Attacks

Fooling SHAP with Output Shuffling Attacks

URL: http://arxiv.org/abs/2408.06509v1
Date: Mon, 12 Aug 2024 21:57:18 GMT
Title: Fooling SHAP with Output Shuffling Attacks
Authors: Jun Yuan, Aritra Dasgupta,
Abstract summary: Explainable AI(XAI) methods such as SHAP can help discover feature attributions in black-box models. adversarial attacks can subvert the detection of XAI methods. We propose a novel family of attacks, called shuffling attacks, that are data-agnostic.
Score: 4.873272103738719
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Explainable AI~(XAI) methods such as SHAP can help discover feature attributions in black-box models. If the method reveals a significant attribution from a ``protected feature'' (e.g., gender, race) on the model output, the model is considered unfair. However, adversarial attacks can subvert the detection of XAI methods. Previous approaches to constructing such an adversarial model require access to underlying data distribution, which may not be possible in many practical scenarios. We relax this constraint and propose a novel family of attacks, called shuffling attacks, that are data-agnostic. The proposed attack strategies can adapt any trained machine learning model to fool Shapley value-based explanations. We prove that Shapley values cannot detect shuffling attacks. However, algorithms that estimate Shapley values, such as linear SHAP and SHAP, can detect these attacks with varying degrees of effectiveness. We demonstrate the efficacy of the attack strategies by comparing the performance of linear SHAP and SHAP using real-world datasets.

Related papers

MISLEADER: Defending against Model Extraction with Ensembles of Distilled Models [56.09354775405601]
Model extraction attacks aim to replicate the functionality of a black-box model through query access.<n>Most existing defenses presume that attacker queries have out-of-distribution (OOD) samples, enabling them to detect and disrupt suspicious inputs.<n>We propose MISLEADER, a novel defense strategy that does not rely on OOD assumptions.
arXiv Detail & Related papers (2025-06-03T01:37:09Z)
Variance-Based Defense Against Blended Backdoor Attacks [0.0]
Backdoor attacks represent a subtle yet effective class of cyberattacks targeting AI models.<n>We propose a novel defense method that trains a model on the given dataset, detects poisoned classes, and extracts the critical part of the attack trigger.
arXiv Detail & Related papers (2025-06-02T09:01:35Z)
No Query, No Access [50.18709429731724]
We introduce the textbfVictim Data-based Adrial Attack (VDBA), which operates using only victim texts.<n>To prevent access to the victim model, we create a shadow dataset with publicly available pre-trained models and clustering methods.<n>Experiments on the Emotion and SST5 datasets show that VDBA outperforms state-of-the-art methods, achieving an ASR improvement of 52.08%.
arXiv Detail & Related papers (2025-05-12T06:19:59Z)
Towards Model Resistant to Transferable Adversarial Examples via Trigger Activation [95.3977252782181]
Adversarial examples, characterized by imperceptible perturbations, pose significant threats to deep neural networks by misleading their predictions. We introduce a novel training paradigm aimed at enhancing robustness against transferable adversarial examples (TAEs) in a more efficient and effective way.
arXiv Detail & Related papers (2025-04-20T09:07:10Z)
A Realistic Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks. Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text. We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing.
arXiv Detail & Related papers (2024-10-21T17:27:01Z)
DTA: Distribution Transform-based Attack for Query-Limited Scenario [11.874670564015789]
In generating adversarial examples, the conventional black-box attack methods rely on sufficient feedback from the to-be-attacked models. This paper proposes a hard-label attack that simulates an attacked action being permitted to conduct a limited number of queries. Experiments validate the effectiveness of the proposed idea and the superiority of DTA over the state-of-the-art.
arXiv Detail & Related papers (2023-12-12T13:21:03Z)
DALA: A Distribution-Aware LoRA-Based Adversarial Attack against Language Models [64.79319733514266]
Adversarial attacks can introduce subtle perturbations to input data. Recent attack methods can achieve a relatively high attack success rate (ASR) We propose a Distribution-Aware LoRA-based Adversarial Attack (DALA) method.
arXiv Detail & Related papers (2023-11-14T23:43:47Z)
Can Adversarial Examples Be Parsed to Reveal Victim Model Information? [62.814751479749695]
In this work, we ask whether it is possible to infer data-agnostic victim model (VM) information from data-specific adversarial instances. We collect a dataset of adversarial attacks across 7 attack types generated from 135 victim models. We show that a simple, supervised model parsing network (MPN) is able to infer VM attributes from unseen adversarial attacks.
arXiv Detail & Related papers (2023-03-13T21:21:49Z)
A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks [72.7373468905418]
We develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning. We also propose CUBE, a simple yet strong clustering-based defense baseline.
arXiv Detail & Related papers (2022-06-17T02:29:23Z)
ExAD: An Ensemble Approach for Explanation-based Adversarial Detection [17.455233006559734]
We propose ExAD, a framework to detect adversarial examples using an ensemble of explanation techniques. We evaluate our approach using six state-of-the-art adversarial attacks on three image datasets.
arXiv Detail & Related papers (2021-03-22T00:53:07Z)
Anomaly Detection-Based Unknown Face Presentation Attack Detection [74.4918294453537]
Anomaly detection-based spoof attack detection is a recent development in face Presentation Attack Detection. In this paper, we present a deep-learning solution for anomaly detection-based spoof attack detection. The proposed approach benefits from the representation learning power of the CNNs and learns better features for fPAD task.
arXiv Detail & Related papers (2020-07-11T21:20:55Z)
Defense for Black-box Attacks on Anti-spoofing Models by Self-Supervised Learning [71.17774313301753]
We explore the robustness of self-supervised learned high-level representations by using them in the defense against adversarial attacks. Experimental results on the ASVspoof 2019 dataset demonstrate that high-level representations extracted by Mockingjay can prevent the transferability of adversarial examples.
arXiv Detail & Related papers (2020-06-05T03:03:06Z)
Feature Partitioning for Robust Tree Ensembles and their Certification in Adversarial Scenarios [8.300942601020266]
We focus on evasion attacks, where a model is trained in a safe environment and exposed to attacks at test time. We propose a model-agnostic strategy that builds a robust ensemble by training its basic models on feature-based partitions of the given dataset. Our algorithm guarantees that the majority of the models in the ensemble cannot be affected by the attacker.
arXiv Detail & Related papers (2020-04-07T12:00:40Z)
Adversarial Detection and Correction by Matching Prediction Distributions [0.0]
The detector almost completely neutralises powerful attacks like Carlini-Wagner or SLIDE on MNIST and Fashion-MNIST. We show that our method is still able to detect the adversarial examples in the case of a white-box attack where the attacker has full knowledge of both the model and the defence.
arXiv Detail & Related papers (2020-02-21T15:45:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.