Related papers: Inferring Sensitive Attributes from Model Explanations

Inferring Sensitive Attributes from Model Explanations

URL: http://arxiv.org/abs/2208.09967v1
Date: Sun, 21 Aug 2022 21:31:19 GMT
Title: Inferring Sensitive Attributes from Model Explanations
Authors: Vasisht Duddu, Antoine Boutet
Abstract summary: dependency of explanations on input raises privacy concerns for sensitive user data. We design the first attribute inference attack against model explanations in two threat models. We show that an adversary can successfully infer the value of sensitive attributes from explanations in both the threat models accurately.
Score: 0.685316573653194
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Model explanations provide transparency into a trained machine learning model's blackbox behavior to a model builder. They indicate the influence of different input attributes to its corresponding model prediction. The dependency of explanations on input raises privacy concerns for sensitive user data. However, current literature has limited discussion on privacy risks of model explanations. We focus on the specific privacy risk of attribute inference attack wherein an adversary infers sensitive attributes of an input (e.g., race and sex) given its model explanations. We design the first attribute inference attack against model explanations in two threat models where model builder either (a) includes the sensitive attributes in training data and input or (b) censors the sensitive attributes by not including them in the training data and input. We evaluate our proposed attack on four benchmark datasets and four state-of-the-art algorithms. We show that an adversary can successfully infer the value of sensitive attributes from explanations in both the threat models accurately. Moreover, the attack is successful even by exploiting only the explanations corresponding to sensitive attributes. These suggest that our attack is effective against explanations and poses a practical threat to data privacy. On combining the model predictions (an attack surface exploited by prior attacks) with explanations, we note that the attack success does not improve. Additionally, the attack success on exploiting model explanations is better compared to exploiting only model predictions. These suggest that model explanations are a strong attack surface to exploit for an adversary.

Related papers

Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models [112.48136829374741]
In this paper, we unveil a new vulnerability: the privacy backdoor attack. When a victim fine-tunes a backdoored model, their training data will be leaked at a significantly higher rate than if they had fine-tuned a typical model. Our findings highlight a critical privacy concern within the machine learning community and call for a reevaluation of safety protocols in the use of open-source pre-trained models.
arXiv Detail & Related papers (2024-04-01T16:50:54Z)
When Machine Learning Models Leak: An Exploration of Synthetic Training Data [0.0]
We investigate an attack on a machine learning model that predicts whether a person or household will relocate in the next two years. The attack assumes that the attacker can query the model to obtain predictions and that the marginal distribution of the data on which the model was trained is publicly available. We explore how replacing the original data with synthetic data when training the model impacts how successfully the attacker can infer sensitive attributes.
arXiv Detail & Related papers (2023-10-12T23:47:22Z)
Analyzing the Impact of Adversarial Examples on Explainable Machine Learning [0.31498833540989407]
Adversarial attacks are a type of attack on machine learning models where an attacker deliberately modifies the inputs to cause the model to make incorrect predictions. Work on the vulnerability of deep learning models to adversarial attacks has shown that it is very easy to make samples that make a model predict things that it doesn't want to. In this work, we analyze the impact of model interpretability due to adversarial attacks on text classification problems.
arXiv Detail & Related papers (2023-07-17T08:50:36Z)
Can Adversarial Examples Be Parsed to Reveal Victim Model Information? [62.814751479749695]
In this work, we ask whether it is possible to infer data-agnostic victim model (VM) information from data-specific adversarial instances. We collect a dataset of adversarial attacks across 7 attack types generated from 135 victim models. We show that a simple, supervised model parsing network (MPN) is able to infer VM attributes from unseen adversarial attacks.
arXiv Detail & Related papers (2023-03-13T21:21:49Z)
MOVE: Effective and Harmless Ownership Verification via Embedded External Features [109.19238806106426]
We propose an effective and harmless model ownership verification (MOVE) to defend against different types of model stealing simultaneously. We conduct the ownership verification by verifying whether a suspicious model contains the knowledge of defender-specified external features. In particular, we develop our MOVE method under both white-box and black-box settings to provide comprehensive model protection.
arXiv Detail & Related papers (2022-08-04T02:22:29Z)
Are Your Sensitive Attributes Private? Novel Model Inversion Attribute Inference Attacks on Classification Models [22.569705869469814]
We focus on model inversion attacks where the adversary knows non-sensitive attributes about records in the training data. We devise a novel confidence score-based model inversion attribute inference attack that significantly outperforms the state-of-the-art. We also extend our attacks to the scenario where some of the other (non-sensitive) attributes of a target record are unknown to the adversary.
arXiv Detail & Related papers (2022-01-23T21:27:20Z)
Property Inference Attacks on Convolutional Neural Networks: Influence and Implications of Target Model's Complexity [1.2891210250935143]
Property Inference Attacks aim to infer from a given model properties about the training dataset seemingly unrelated to the model's primary goal. This paper investigates the influence of the target model's complexity on the accuracy of this type of attack. Our findings reveal that the risk of a privacy breach is present independently of the target model's complexity.
arXiv Detail & Related papers (2021-04-27T09:19:36Z)
Black-box Model Inversion Attribute Inference Attacks on Classification Models [32.757792981935815]
We focus on one kind of model inversion attacks, where the adversary knows non-sensitive attributes about instances in the training data. We devise two novel model inversion attribute inference attacks -- confidence modeling-based attack and confidence score-based attack. We evaluate our attacks on two types of machine learning models, decision tree and deep neural network, trained with two real datasets.
arXiv Detail & Related papers (2020-12-07T01:14:19Z)
Learning to Attack: Towards Textual Adversarial Attacking in Real-world Situations [81.82518920087175]
Adversarial attacking aims to fool deep neural networks with adversarial examples. We propose a reinforcement learning based attack model, which can learn from attack history and launch attacks more efficiently.
arXiv Detail & Related papers (2020-09-19T09:12:24Z)
Model extraction from counterfactual explanations [68.8204255655161]
We show how an adversary can leverage the information provided by counterfactual explanations to build high-fidelity and high-accuracy model extraction attacks. Our attack enables the adversary to build a faithful copy of a target model by accessing its counterfactual explanations.
arXiv Detail & Related papers (2020-09-03T19:02:55Z)
Adversarial Imitation Attack [63.76805962712481]
A practical adversarial attack should require as little as possible knowledge of attacked models. Current substitute attacks need pre-trained models to generate adversarial examples. In this study, we propose a novel adversarial imitation attack.
arXiv Detail & Related papers (2020-03-28T10:02:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.