Differentially Private Synthetic Data Using KD-Trees
- URL: http://arxiv.org/abs/2306.13211v1
- Date: Mon, 19 Jun 2023 17:08:32 GMT
- Title: Differentially Private Synthetic Data Using KD-Trees
- Authors: Eleonora Krea\v{c}i\'c, Navid Nouri, Vamsi K. Potluru, Tucker Balch,
Manuela Veloso
- Abstract summary: We exploit space partitioning techniques together with noise perturbation and thus achieve intuitive and transparent algorithms.
We propose both data independent and data dependent algorithms for $epsilon$-differentially private synthetic data generation.
We show empirical utility improvements over the prior work, and discuss performance of our algorithm on a downstream classification task on a real dataset.
- Score: 11.96971298978997
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Creation of a synthetic dataset that faithfully represents the data
distribution and simultaneously preserves privacy is a major research
challenge. Many space partitioning based approaches have emerged in recent
years for answering statistical queries in a differentially private manner.
However, for synthetic data generation problem, recent research has been mainly
focused on deep generative models. In contrast, we exploit space partitioning
techniques together with noise perturbation and thus achieve intuitive and
transparent algorithms. We propose both data independent and data dependent
algorithms for $\epsilon$-differentially private synthetic data generation
whose kernel density resembles that of the real dataset. Additionally, we
provide theoretical results on the utility-privacy trade-offs and show how our
data dependent approach overcomes the curse of dimensionality and leads to a
scalable algorithm. We show empirical utility improvements over the prior work,
and discuss performance of our algorithm on a downstream classification task on
a real dataset.
Related papers
- Differentially Private Synthetic High-dimensional Tabular Stream [7.726042106665366]
We propose an algorithmic framework for streaming data that generates multiple synthetic datasets over time.
Our algorithm satisfies differential privacy for the entire input stream.
We show the utility of our method via experiments on real-world datasets.
arXiv Detail & Related papers (2024-08-31T01:31:59Z) - Differentially Private Synthetic Data with Private Density Estimation [2.209921757303168]
We adopt the framework of differential privacy, and explore mechanisms for generating an entire dataset.
We build upon the work of Boedihardjo et al, which laid the foundations for a new optimization-based algorithm for generating private synthetic data.
arXiv Detail & Related papers (2024-05-06T14:06:12Z) - An Algorithm for Streaming Differentially Private Data [7.726042106665366]
We derive an algorithm for differentially private synthetic streaming data generation, especially curated towards spatial datasets.
The utility of our algorithm is verified on both real-world and simulated datasets.
arXiv Detail & Related papers (2024-01-26T00:32:31Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - On the Inherent Privacy Properties of Discrete Denoising Diffusion Models [17.773335593043004]
We present the pioneering theoretical exploration of the privacy preservation inherent in discrete diffusion models.
Our framework elucidates the potential privacy leakage for each data point in a given training dataset.
Our bounds also show that training with $s$-sized data points leads to a surge in privacy leakage.
arXiv Detail & Related papers (2023-10-24T05:07:31Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Differentially Private Algorithms for Synthetic Power System Datasets [0.0]
Power systems research relies on the availability of real-world network datasets.
Data owners are hesitant to share data due to security and privacy risks.
We develop privacy-preserving algorithms for the synthetic generation of optimization and machine learning datasets.
arXiv Detail & Related papers (2023-03-20T13:38:58Z) - Learning to Bound Counterfactual Inference in Structural Causal Models
from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm.
The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources.
It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z) - Private Set Generation with Discriminative Information [63.851085173614]
Differentially private data generation is a promising solution to the data privacy challenge.
Existing private generative models are struggling with the utility of synthetic samples.
We introduce a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-07T10:02:55Z) - Delving into High-Quality Synthetic Face Occlusion Segmentation Datasets [83.749895930242]
We propose two techniques for producing high-quality naturalistic synthetic occluded faces.
We empirically show the effectiveness and robustness of both methods, even for unseen occlusions.
We present two high-resolution real-world occluded face datasets with fine-grained annotations, RealOcc and RealOcc-Wild.
arXiv Detail & Related papers (2022-05-12T17:03:57Z) - Learning while Respecting Privacy and Robustness to Distributional
Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model.
The objective is to endow the trained model with robustness against adversarially manipulated input data.
Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.