Rethinking Privacy in Machine Learning Pipelines from an Information
Flow Control Perspective
- URL: http://arxiv.org/abs/2311.15792v1
- Date: Mon, 27 Nov 2023 13:14:39 GMT
- Title: Rethinking Privacy in Machine Learning Pipelines from an Information
Flow Control Perspective
- Authors: Lukas Wutschitz, Boris K\"opf, Andrew Paverd, Saravan Rajmohan, Ahmed
Salem, Shruti Tople, Santiago Zanella-B\'eguelin, Menglin Xia, Victor R\"uhle
- Abstract summary: Modern machine learning systems use models trained on ever-growing corpora.
metadata such as ownership, access control, or licensing information is ignored during training.
We take an information flow control perspective to describe machine learning systems.
- Score: 16.487545258246932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern machine learning systems use models trained on ever-growing corpora.
Typically, metadata such as ownership, access control, or licensing information
is ignored during training. Instead, to mitigate privacy risks, we rely on
generic techniques such as dataset sanitization and differentially private
model training, with inherent privacy/utility trade-offs that hurt model
performance. Moreover, these techniques have limitations in scenarios where
sensitive information is shared across multiple participants and fine-grained
access control is required. By ignoring metadata, we therefore miss an
opportunity to better address security, privacy, and confidentiality
challenges. In this paper, we take an information flow control perspective to
describe machine learning systems, which allows us to leverage metadata such as
access control policies and define clear-cut privacy and confidentiality
guarantees with interpretable information flows. Under this perspective, we
contrast two different approaches to achieve user-level non-interference: 1)
fine-tuning per-user models, and 2) retrieval augmented models that access
user-specific datasets at inference time. We compare these two approaches to a
trivially non-interfering zero-shot baseline using a public model and to a
baseline that fine-tunes this model on the whole corpus. We evaluate trained
models on two datasets of scientific articles and demonstrate that retrieval
augmented architectures deliver the best utility, scalability, and flexibility
while satisfying strict non-interference guarantees.
Related papers
- Game-Theoretic Machine Unlearning: Mitigating Extra Privacy Leakage [12.737028324709609]
Recent legislation obligates organizations to remove requested data and its influence from a trained model.
We propose a game-theoretic machine unlearning algorithm that simulates the competitive relationship between unlearning performance and privacy protection.
arXiv Detail & Related papers (2024-11-06T13:47:04Z) - Verification of Machine Unlearning is Fragile [48.71651033308842]
We introduce two novel adversarial unlearning processes capable of circumventing both types of verification strategies.
This study highlights the vulnerabilities and limitations in machine unlearning verification, paving the way for further research into the safety of machine unlearning.
arXiv Detail & Related papers (2024-08-01T21:37:10Z) - Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models [112.48136829374741]
In this paper, we unveil a new vulnerability: the privacy backdoor attack.
When a victim fine-tunes a backdoored model, their training data will be leaked at a significantly higher rate than if they had fine-tuned a typical model.
Our findings highlight a critical privacy concern within the machine learning community and call for a reevaluation of safety protocols in the use of open-source pre-trained models.
arXiv Detail & Related papers (2024-04-01T16:50:54Z) - Privacy Side Channels in Machine Learning Systems [87.53240071195168]
We introduce privacy side channels: attacks that exploit system-level components to extract private information.
For example, we show that deduplicating training data before applying differentially-private training creates a side-channel that completely invalidates any provable privacy guarantees.
We further show that systems which block language models from regenerating training data can be exploited to exfiltrate private keys contained in the training set.
arXiv Detail & Related papers (2023-09-11T16:49:05Z) - Enhancing Multiple Reliability Measures via Nuisance-extended
Information Bottleneck [77.37409441129995]
In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition.
We consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training.
We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training.
arXiv Detail & Related papers (2023-03-24T16:03:21Z) - Tight Auditing of Differentially Private Machine Learning [77.38590306275877]
For private machine learning, existing auditing mechanisms are tight.
They only give tight estimates under implausible worst-case assumptions.
We design an improved auditing scheme that yields tight privacy estimates for natural (not adversarially crafted) datasets.
arXiv Detail & Related papers (2023-02-15T21:40:33Z) - A Survey on Differential Privacy with Machine Learning and Future
Outlook [0.0]
differential privacy is used to protect machine learning models from any attacks and vulnerabilities.
This survey paper presents different differentially private machine learning algorithms categorized into two main categories.
arXiv Detail & Related papers (2022-11-19T14:20:53Z) - Applied Federated Learning: Architectural Design for Robust and
Efficient Learning in Privacy Aware Settings [0.8454446648908585]
The classical machine learning paradigm requires the aggregation of user data in a central location.
Centralization of data poses risks, including a heightened risk of internal and external security incidents.
Federated learning with differential privacy is designed to avoid the server-side centralization pitfall.
arXiv Detail & Related papers (2022-06-02T00:30:04Z) - Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks,
and Defenses [150.64470864162556]
This work systematically categorizes and discusses a wide range of dataset vulnerabilities and exploits.
In addition to describing various poisoning and backdoor threat models and the relationships among them, we develop their unified taxonomy.
arXiv Detail & Related papers (2020-12-18T22:38:47Z) - SPEED: Secure, PrivatE, and Efficient Deep learning [2.283665431721732]
We introduce a deep learning framework able to deal with strong privacy constraints.
Based on collaborative learning, differential privacy and homomorphic encryption, the proposed approach advances state-of-the-art.
arXiv Detail & Related papers (2020-06-16T19:31:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.