Does Training with Synthetic Data Truly Protect Privacy?
- URL: http://arxiv.org/abs/2502.12976v1
- Date: Tue, 18 Feb 2025 15:56:52 GMT
- Title: Does Training with Synthetic Data Truly Protect Privacy?
- Authors: Yunpeng Zhao, Jie Zhang,
- Abstract summary: We explore four different training paradigms: coreset selection, dataset distillation, data-free knowledge distillation, and synthetic data generated from diffusion models.
We caution that empirical approaches to preserving data privacy require careful and rigorous evaluation.
- Score: 2.793318238046947
- License:
- Abstract: As synthetic data becomes increasingly popular in machine learning tasks, numerous methods--without formal differential privacy guarantees--use synthetic data for training. These methods often claim, either explicitly or implicitly, to protect the privacy of the original training data. In this work, we explore four different training paradigms: coreset selection, dataset distillation, data-free knowledge distillation, and synthetic data generated from diffusion models. While all these methods utilize synthetic data for training, they lead to vastly different conclusions regarding privacy preservation. We caution that empirical approaches to preserving data privacy require careful and rigorous evaluation; otherwise, they risk providing a false sense of privacy.
Related papers
- Synthetic Data Privacy Metrics [2.1213500139850017]
We review the pros and cons of popular metrics that include simulations of adversarial attacks.
We also review current best practices for amending generative models to enhance the privacy of the data they create.
arXiv Detail & Related papers (2025-01-07T17:02:33Z) - Activity Recognition on Avatar-Anonymized Datasets with Masked Differential Privacy [64.32494202656801]
Privacy-preserving computer vision is an important emerging problem in machine learning and artificial intelligence.
We present anonymization pipeline that replaces sensitive human subjects in video datasets with synthetic avatars within context.
We also proposeMaskDP to protect non-anonymized but privacy sensitive background information.
arXiv Detail & Related papers (2024-10-22T15:22:53Z) - Privacy-Preserving Student Learning with Differentially Private Data-Free Distillation [35.37005050907983]
We present an effective teacher-student learning approach to train privacy-preserving deep learning models.
Massive synthetic data can be generated for model training without exposing data privacy.
A student is trained on the synthetic data with the supervision of private labels.
arXiv Detail & Related papers (2024-09-19T01:00:18Z) - FewFedPIT: Towards Privacy-preserving and Few-shot Federated Instruction Tuning [54.26614091429253]
Federated instruction tuning (FedIT) is a promising solution, by consolidating collaborative training across multiple data owners.
FedIT encounters limitations such as scarcity of instructional data and risk of exposure to training data extraction attacks.
We propose FewFedPIT, designed to simultaneously enhance privacy protection and model performance of federated few-shot learning.
arXiv Detail & Related papers (2024-03-10T08:41:22Z) - Practical considerations on using private sampling for synthetic data [1.3654846342364308]
Differential privacy for synthetic data generation has received much attention due to the ability of preserving privacy while freely using the synthetic data.
Private sampling is the first noise-free method to construct differentially private synthetic data with rigorous bounds for privacy and accuracy.
We provide an implementation of the private sampling algorithm and discuss the realism of its constraints in practical cases.
arXiv Detail & Related papers (2023-12-12T10:20:04Z) - SoK: Privacy-Preserving Data Synthesis [72.92263073534899]
This paper focuses on privacy-preserving data synthesis (PPDS) by providing a comprehensive overview, analysis, and discussion of the field.
We put forth a master recipe that unifies two prominent strands of research in PPDS: statistical methods and deep learning (DL)-based methods.
arXiv Detail & Related papers (2023-07-05T08:29:31Z) - Stop Uploading Test Data in Plain Text: Practical Strategies for
Mitigating Data Contamination by Evaluation Benchmarks [70.39633252935445]
Data contamination has become prevalent and challenging with the rise of models pretrained on large automatically-crawled corpora.
For closed models, the training data becomes a trade secret, and even for open models, it is not trivial to detect contamination.
We propose three strategies that can make a difference: (1) Test data made public should be encrypted with a public key and licensed to disallow derivative distribution; (2) demand training exclusion controls from closed API holders, and protect your test data by refusing to evaluate without them; and (3) avoid data which appears with its solution on the internet, and release the web-page context of internet-derived
arXiv Detail & Related papers (2023-05-17T12:23:38Z) - Differentially Private Synthetic Data Generation via
Lipschitz-Regularised Variational Autoencoders [3.7463972693041274]
It is often overlooked that generative models are prone to memorising many details of individual training records.
In this paper we explore an alternative approach for privately generating data that makes direct use of the inherentity in generative models.
arXiv Detail & Related papers (2023-04-22T07:24:56Z) - Certified Data Removal in Sum-Product Networks [78.27542864367821]
Deleting the collected data is often insufficient to guarantee data privacy.
UnlearnSPN is an algorithm that removes the influence of single data points from a trained sum-product network.
arXiv Detail & Related papers (2022-10-04T08:22:37Z) - The Privacy Onion Effect: Memorization is Relative [76.46529413546725]
We show an Onion Effect of memorization: removing the "layer" of outlier points that are most vulnerable exposes a new layer of previously-safe points to the same attack.
It suggests that privacy-enhancing technologies such as machine unlearning could actually harm the privacy of other users.
arXiv Detail & Related papers (2022-06-21T15:25:56Z) - Synthetic Data -- Anonymisation Groundhog Day [4.694549066382216]
We present the first quantitative evaluation of the privacy gain of synthetic data publishing.
We show that synthetic data does not prevent inference attacks or does not retain data utility.
In contrast to traditional anonymisation, the privacy-utility tradeoff of synthetic data publishing is hard to predict.
arXiv Detail & Related papers (2020-11-13T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.