Approximate, Adapt, Anonymize (3A): a Framework for Privacy Preserving
Training Data Release for Machine Learning
- URL: http://arxiv.org/abs/2307.01875v1
- Date: Tue, 4 Jul 2023 18:37:11 GMT
- Title: Approximate, Adapt, Anonymize (3A): a Framework for Privacy Preserving
Training Data Release for Machine Learning
- Authors: Tamas Madl, Weijie Xu, Olivia Choudhury, Matthew Howard
- Abstract summary: We introduce a data release framework, 3A (Approximate, Adapt, Anonymize), to maximize data utility for machine learning.
We present experimental evidence showing minimal discrepancy between performance metrics of models trained on real versus privatized datasets.
- Score: 3.29354893777827
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The availability of large amounts of informative data is crucial for
successful machine learning. However, in domains with sensitive information,
the release of high-utility data which protects the privacy of individuals has
proven challenging. Despite progress in differential privacy and generative
modeling for privacy-preserving data release in the literature, only a few
approaches optimize for machine learning utility: most approaches only take
into account statistical metrics on the data itself and fail to explicitly
preserve the loss metrics of machine learning models that are to be
subsequently trained on the generated data. In this paper, we introduce a data
release framework, 3A (Approximate, Adapt, Anonymize), to maximize data utility
for machine learning, while preserving differential privacy. We also describe a
specific implementation of this framework that leverages mixture models to
approximate, kernel-inducing points to adapt, and Gaussian differential privacy
to anonymize a dataset, in order to ensure that the resulting data is both
privacy-preserving and high utility. We present experimental evidence showing
minimal discrepancy between performance metrics of models trained on real
versus privatized datasets, when evaluated on held-out real data. We also
compare our results with several privacy-preserving synthetic data generation
models (such as differentially private generative adversarial networks), and
report significant increases in classification performance metrics compared to
state-of-the-art models. These favorable comparisons show that the presented
framework is a promising direction of research, increasing the utility of
low-risk synthetic data release for machine learning.
Related papers
- Pseudo-Probability Unlearning: Towards Efficient and Privacy-Preserving Machine Unlearning [59.29849532966454]
We propose PseudoProbability Unlearning (PPU), a novel method that enables models to forget data to adhere to privacy-preserving manner.
Our method achieves over 20% improvements in forgetting error compared to the state-of-the-art.
arXiv Detail & Related papers (2024-11-04T21:27:06Z) - Privacy-Preserving Debiasing using Data Augmentation and Machine Unlearning [3.049887057143419]
Data augmentation exposes machine learning models to privacy attacks, such as membership inference attacks.
We propose an effective combination of data augmentation and machine unlearning, which can reduce data bias while providing a provable defense against known attacks.
arXiv Detail & Related papers (2024-04-19T21:54:20Z) - FewFedPIT: Towards Privacy-preserving and Few-shot Federated Instruction Tuning [54.26614091429253]
Federated instruction tuning (FedIT) is a promising solution, by consolidating collaborative training across multiple data owners.
FedIT encounters limitations such as scarcity of instructional data and risk of exposure to training data extraction attacks.
We propose FewFedPIT, designed to simultaneously enhance privacy protection and model performance of federated few-shot learning.
arXiv Detail & Related papers (2024-03-10T08:41:22Z) - PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - Privacy-Preserving Graph Machine Learning from Data to Computation: A
Survey [67.7834898542701]
We focus on reviewing privacy-preserving techniques of graph machine learning.
We first review methods for generating privacy-preserving graph data.
Then we describe methods for transmitting privacy-preserved information.
arXiv Detail & Related papers (2023-07-10T04:30:23Z) - Differentially Private Synthetic Data Generation via
Lipschitz-Regularised Variational Autoencoders [3.7463972693041274]
It is often overlooked that generative models are prone to memorising many details of individual training records.
In this paper we explore an alternative approach for privately generating data that makes direct use of the inherentity in generative models.
arXiv Detail & Related papers (2023-04-22T07:24:56Z) - Utility Assessment of Synthetic Data Generation Methods [0.0]
We investigate whether different methods of generating fully synthetic data vary in their utility a priori.
We find some methods to perform better than others across the board.
We do get promising findings for classification tasks when using synthetic data for training machine learning models.
arXiv Detail & Related papers (2022-11-23T11:09:52Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - Private Set Generation with Discriminative Information [63.851085173614]
Differentially private data generation is a promising solution to the data privacy challenge.
Existing private generative models are struggling with the utility of synthetic samples.
We introduce a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-07T10:02:55Z) - Differentially Private Synthetic Data: Applied Evaluations and
Enhancements [4.749807065324706]
Differentially private data synthesis protects personal details from exposure.
We evaluate four differentially private generative adversarial networks for data synthesis.
We propose QUAIL, an ensemble-based modeling approach to generating synthetic data.
arXiv Detail & Related papers (2020-11-11T04:03:08Z) - Privacy Enhancing Machine Learning via Removal of Unwanted Dependencies [21.97951347784442]
This paper studies new variants of supervised and adversarial learning methods, which remove the sensitive information in the data before they are sent out for a particular application.
The explored methods optimize privacy preserving feature mappings and predictive models simultaneously in an end-to-end fashion.
Experimental results on mobile sensing and face datasets demonstrate that our models can successfully maintain the utility performances of predictive models while causing sensitive predictions to perform poorly.
arXiv Detail & Related papers (2020-07-30T19:55:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.