On the amplification of security and privacy risks by post-hoc
explanations in machine learning models
- URL: http://arxiv.org/abs/2206.14004v1
- Date: Tue, 28 Jun 2022 13:46:06 GMT
- Title: On the amplification of security and privacy risks by post-hoc
explanations in machine learning models
- Authors: Pengrui Quan, Supriyo Chakraborty, Jeya Vikranth Jeyakumar, Mani
Srivastava
- Abstract summary: Post-hoc explanation methods that highlight input dimensions according to their importance or relevance to the result also leak information that weakens security and privacy.
We propose novel explanation-guided black-box evasion attacks that lead to 10 times reduction in query count for the same success rate.
We show that the adversarial advantage from explanations can be quantified as a reduction in the total variance of the estimated gradient.
- Score: 7.564511776742979
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: A variety of explanation methods have been proposed in recent years to help
users gain insights into the results returned by neural networks, which are
otherwise complex and opaque black-boxes. However, explanations give rise to
potential side-channels that can be leveraged by an adversary for mounting
attacks on the system. In particular, post-hoc explanation methods that
highlight input dimensions according to their importance or relevance to the
result also leak information that weakens security and privacy. In this work,
we perform the first systematic characterization of the privacy and security
risks arising from various popular explanation techniques. First, we propose
novel explanation-guided black-box evasion attacks that lead to 10 times
reduction in query count for the same success rate. We show that the
adversarial advantage from explanations can be quantified as a reduction in the
total variance of the estimated gradient. Second, we revisit the membership
information leaked by common explanations. Contrary to observations in prior
studies, via our modified attacks we show significant leakage of membership
information (above 100% improvement over prior results), even in a much
stricter black-box setting. Finally, we study explanation-guided model
extraction attacks and demonstrate adversarial gains through a large reduction
in query count.
Related papers
- Jailbreaking as a Reward Misspecification Problem [80.52431374743998]
We propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process.
We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness.
We present ReMiss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space.
arXiv Detail & Related papers (2024-06-20T15:12:27Z) - GLiRA: Black-Box Membership Inference Attack via Knowledge Distillation [4.332441337407564]
We explore a connection between the susceptibility to membership inference attacks and the vulnerability to distillation-based functionality stealing attacks.
We propose GLiRA, a distillation-guided approach to membership inference attack on the black-box neural network.
We evaluate the proposed method across multiple image classification datasets and models and demonstrate that likelihood ratio attacks when guided by the knowledge distillation, outperform the current state-of-the-art membership inference attacks in the black-box setting.
arXiv Detail & Related papers (2024-05-13T08:52:04Z) - Explaining Predictive Uncertainty by Exposing Second-Order Effects [13.83164409095901]
We present a new method for explaining predictive uncertainty based on second-order effects.
Our method is generally applicable, allowing for turning common attribution techniques into powerful second-order uncertainty explainers.
arXiv Detail & Related papers (2024-01-30T21:02:21Z) - The privacy issue of counterfactual explanations: explanation linkage
attacks [0.0]
We introduce the explanation linkage attack, which can occur when deploying instance-based strategies to find counterfactual explanations.
To counter such an attack, we propose k-anonymous counterfactual explanations and introduce pureness as a new metric to evaluate the validity of these k-anonymous counterfactual explanations.
Our results show that making the explanations, rather than the whole dataset, k- anonymous, is beneficial for the quality of the explanations.
arXiv Detail & Related papers (2022-10-21T15:44:19Z) - Protecting Split Learning by Potential Energy Loss [70.81375125791979]
We focus on the privacy leakage from the forward embeddings of split learning.
We propose the potential energy loss to make the forward embeddings become more 'complicated'
arXiv Detail & Related papers (2022-10-18T06:21:11Z) - Private Graph Extraction via Feature Explanations [0.7442906193848509]
We study the interplay of privacy and interpretability in graph machine learning through graph reconstruction attacks.
We show that additional knowledge of post-hoc feature explanations substantially increases the success rate of these attacks.
We propose a defense based on a randomized response mechanism for releasing the explanations, which substantially reduces the attack success rate.
arXiv Detail & Related papers (2022-06-29T15:47:34Z) - Aurora Guard: Reliable Face Anti-Spoofing via Mobile Lighting System [103.5604680001633]
Anti-spoofing against high-resolution rendering replay of paper photos or digital videos remains an open problem.
We propose a simple yet effective face anti-spoofing system, termed Aurora Guard (AG)
arXiv Detail & Related papers (2021-02-01T09:17:18Z) - Local Black-box Adversarial Attacks: A Query Efficient Approach [64.98246858117476]
Adrial attacks have threatened the application of deep neural networks in security-sensitive scenarios.
We propose a novel framework to perturb the discriminative areas of clean examples only within limited queries in black-box attacks.
We conduct extensive experiments to show that our framework can significantly improve the query efficiency during black-box perturbing with a high attack success rate.
arXiv Detail & Related papers (2021-01-04T15:32:16Z) - Model extraction from counterfactual explanations [68.8204255655161]
We show how an adversary can leverage the information provided by counterfactual explanations to build high-fidelity and high-accuracy model extraction attacks.
Our attack enables the adversary to build a faithful copy of a target model by accessing its counterfactual explanations.
arXiv Detail & Related papers (2020-09-03T19:02:55Z) - Sampling Attacks: Amplification of Membership Inference Attacks by
Repeated Queries [74.59376038272661]
We introduce sampling attack, a novel membership inference technique that unlike other standard membership adversaries is able to work under severe restriction of no access to scores of the victim model.
We show that a victim model that only publishes the labels is still susceptible to sampling attacks and the adversary can recover up to 100% of its performance.
For defense, we choose differential privacy in the form of gradient perturbation during the training of the victim model as well as output perturbation at prediction time.
arXiv Detail & Related papers (2020-09-01T12:54:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.