Joint Selection: Adaptively Incorporating Public Information for Private
Synthetic Data
- URL: http://arxiv.org/abs/2403.07797v1
- Date: Tue, 12 Mar 2024 16:34:07 GMT
- Title: Joint Selection: Adaptively Incorporating Public Information for Private
Synthetic Data
- Authors: Miguel Fuentes, Brett Mullins, Ryan McKenna, Gerome Miklau, Daniel
Sheldon
- Abstract summary: We develop the mechanism jam-pgm, which expands the adaptive measurements framework to jointly select between measuring public data and private data.
We show that jam-pgm is able to outperform both publicly assisted and non publicly assisted synthetic data generation mechanisms even when the public data distribution is biased.
- Score: 13.56146208014469
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mechanisms for generating differentially private synthetic data based on
marginals and graphical models have been successful in a wide range of
settings. However, one limitation of these methods is their inability to
incorporate public data. Initializing a data generating model by pre-training
on public data has shown to improve the quality of synthetic data, but this
technique is not applicable when model structure is not determined a priori. We
develop the mechanism jam-pgm, which expands the adaptive measurements
framework to jointly select between measuring public data and private data.
This technique allows for public data to be included in a graphical-model-based
mechanism. We show that jam-pgm is able to outperform both publicly assisted
and non publicly assisted synthetic data generation mechanisms even when the
public data distribution is biased.
Related papers
- Do You Really Need Public Data? Surrogate Public Data for Differential Privacy on Tabular Data [10.1687640711587]
This work introduces the notion of "surrogate" public data, which consume no privacy loss budget and are constructed solely from publicly available schema or metadata.
We automate the process of generating surrogate public data with large language models (LLMs)
In particular, we propose two methods: direct record generation as CSV files, and automated structural causal model (SCM) construction for sampling records.
arXiv Detail & Related papers (2025-04-19T17:55:10Z) - Leveraging Vertical Public-Private Split for Improved Synthetic Data Generation [9.819636361032256]
Differentially Private Synthetic Data Generation is a key enabler of private and secure data sharing.
Recent literature has explored scenarios where a small amount of public data is used to help enhance the quality of synthetic data.
We propose a novel framework that adapts horizontal public-assisted methods into the vertical setting.
arXiv Detail & Related papers (2025-04-15T08:59:03Z) - Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs [20.774525687291167]
We propose a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale finetuning.
CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data.
To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram.
arXiv Detail & Related papers (2025-03-16T04:00:32Z) - Differentially Private Random Feature Model [52.468511541184895]
We produce a differentially private random feature model for privacy-preserving kernel machines.
We show that our method preserves privacy and derive a generalization error bound for the method.
arXiv Detail & Related papers (2024-12-06T05:31:08Z) - Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data.
SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Membership Inference Attacks against Synthetic Data through Overfitting
Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution.
We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z) - PreFair: Privately Generating Justifiably Fair Synthetic Data [17.037575948075215]
PreFair is a system that allows for Differential Privacy (DP) fair synthetic data generation.
We adapt the notion of justifiable fairness to fit the synthetic data generation scenario.
arXiv Detail & Related papers (2022-12-20T15:01:54Z) - Private Set Generation with Discriminative Information [63.851085173614]
Differentially private data generation is a promising solution to the data privacy challenge.
Existing private generative models are struggling with the utility of synthetic samples.
We introduce a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-07T10:02:55Z) - Bias Mitigated Learning from Differentially Private Synthetic Data: A
Cautionary Tale [13.881022208028751]
Bias can affect all analyses as the synthetic data distribution is an inconsistent estimate of the real-data distribution.
We propose several bias mitigation strategies using privatized likelihood ratios.
We show that bias mitigation provides simple and effective privacy-compliant augmentation for general applications of synthetic data.
arXiv Detail & Related papers (2021-08-24T19:56:44Z) - An Analysis of the Deployment of Models Trained on Private Tabular
Synthetic Data: Unexpected Surprises [4.129847064263057]
Diferentially private (DP) synthetic datasets are a powerful approach for training machine learning models.
We study the effects of differentially private synthetic data generation on classification.
arXiv Detail & Related papers (2021-06-15T21:00:57Z) - Causally Constrained Data Synthesis for Private Data Release [36.80484740314504]
Using synthetic data which reflects certain statistical properties of the original data preserves the privacy of the original data.
Prior works utilize differentially private data release mechanisms to provide formal privacy guarantees.
We propose incorporating causal information into the training process to favorably modify the aforementioned trade-off.
arXiv Detail & Related papers (2021-05-27T13:46:57Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z) - Incorporating Causal Graphical Prior Knowledge into Predictive Modeling
via Simple Data Augmentation [92.96204497841032]
Causal graphs (CGs) are compact representations of the knowledge of the data generating processes behind the data distributions.
We propose a model-agnostic data augmentation method that allows us to exploit the prior knowledge of the conditional independence (CI) relations.
We experimentally show that the proposed method is effective in improving the prediction accuracy, especially in the small-data regime.
arXiv Detail & Related papers (2021-02-27T06:13:59Z) - Differentially Private Synthetic Medical Data Generation using
Convolutional GANs [7.2372051099165065]
We develop a differentially private framework for synthetic data generation using R'enyi differential privacy.
Our approach builds on convolutional autoencoders and convolutional generative adversarial networks to preserve some of the critical characteristics of the generated synthetic data.
We demonstrate that our model outperforms existing state-of-the-art models under the same privacy budget.
arXiv Detail & Related papers (2020-12-22T01:03:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.