Statistical Theory of Differentially Private Marginal-based Data
Synthesis Algorithms
- URL: http://arxiv.org/abs/2301.08844v2
- Date: Wed, 25 Jan 2023 02:27:44 GMT
- Title: Statistical Theory of Differentially Private Marginal-based Data
Synthesis Algorithms
- Authors: Ximing Li, Chendi Wang, Guang Cheng
- Abstract summary: Marginal-based methods achieve promising performance in the synthetic data competition hosted by the National Institute of Standards and Technology (NIST)
Despite its promising performance in practice, the statistical properties of marginal-based methods are rarely studied in the literature.
- Score: 30.330715718619874
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Marginal-based methods achieve promising performance in the synthetic data
competition hosted by the National Institute of Standards and Technology
(NIST). To deal with high-dimensional data, the distribution of synthetic data
is represented by a probabilistic graphical model (e.g., a Bayesian network),
while the raw data distribution is approximated by a collection of
low-dimensional marginals. Differential privacy (DP) is guaranteed by
introducing random noise to each low-dimensional marginal distribution. Despite
its promising performance in practice, the statistical properties of
marginal-based methods are rarely studied in the literature. In this paper, we
study DP data synthesis algorithms based on Bayesian networks (BN) from a
statistical perspective. We establish a rigorous accuracy guarantee for
BN-based algorithms, where the errors are measured by the total variation (TV)
distance or the $L^2$ distance. Related to downstream machine learning tasks,
an upper bound for the utility error of the DP synthetic data is also derived.
To complete the picture, we establish a lower bound for TV accuracy that holds
for every $\epsilon$-DP synthetic data generator.
Related papers
- Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - Distributed Markov Chain Monte Carlo Sampling based on the Alternating
Direction Method of Multipliers [143.6249073384419]
In this paper, we propose a distributed sampling scheme based on the alternating direction method of multipliers.
We provide both theoretical guarantees of our algorithm's convergence and experimental evidence of its superiority to the state-of-the-art.
In simulation, we deploy our algorithm on linear and logistic regression tasks and illustrate its fast convergence compared to existing gradient-based methods.
arXiv Detail & Related papers (2024-01-29T02:08:40Z) - Differentially Private Synthetic Data Using KD-Trees [11.96971298978997]
We exploit space partitioning techniques together with noise perturbation and thus achieve intuitive and transparent algorithms.
We propose both data independent and data dependent algorithms for $epsilon$-differentially private synthetic data generation.
We show empirical utility improvements over the prior work, and discuss performance of our algorithm on a downstream classification task on a real dataset.
arXiv Detail & Related papers (2023-06-19T17:08:32Z) - Distributed Semi-Supervised Sparse Statistical Inference [6.685997976921953]
A debiased estimator is a crucial tool in statistical inference for high-dimensional model parameters.
Traditional methods require computing a debiased estimator on every machine.
An efficient multi-round distributed debiased estimator, which integrates both labeled and unlabelled data, is developed.
arXiv Detail & Related papers (2023-06-17T17:30:43Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Noise-Aware Statistical Inference with Differentially Private Synthetic
Data [0.0]
We show that simply analysing DP synthetic data as if it were real does not produce valid inferences of population-level quantities.
We tackle this problem by combining synthetic data analysis techniques from the field of multiple imputation, and synthetic data generation.
We develop a novel noise-aware synthetic data generation algorithm NAPSU-MQ using the principle of maximum entropy.
arXiv Detail & Related papers (2022-05-28T16:59:46Z) - Diverse Sample Generation: Pushing the Limit of Data-free Quantization [85.95032037447454]
This paper presents a generic Diverse Sample Generation scheme for the generative data-free post-training quantization and quantization-aware training.
For large-scale image classification tasks, our DSG can consistently outperform existing data-free quantization methods.
arXiv Detail & Related papers (2021-09-01T07:06:44Z) - Learning while Respecting Privacy and Robustness to Distributional
Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model.
The objective is to endow the trained model with robustness against adversarially manipulated input data.
Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z) - One Step to Efficient Synthetic Data [9.3000873953175]
A common approach to synthetic data is to sample from a fitted model.
We show that this approach results in a sample with inefficient estimators and whose joint distribution is inconsistent with the true distribution.
Motivated by this, we propose a general method of producing synthetic data.
arXiv Detail & Related papers (2020-06-03T17:12:11Z) - Distribution Approximation and Statistical Estimation Guarantees of
Generative Adversarial Networks [82.61546580149427]
Generative Adversarial Networks (GANs) have achieved a great success in unsupervised learning.
This paper provides approximation and statistical guarantees of GANs for the estimation of data distributions with densities in a H"older space.
arXiv Detail & Related papers (2020-02-10T16:47:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.