Private Synthetic Data Generation in Small Memory
- URL: http://arxiv.org/abs/2412.09756v1
- Date: Thu, 12 Dec 2024 23:24:05 GMT
- Title: Private Synthetic Data Generation in Small Memory
- Authors: Rayne Holland, Seyit Camtepe, Chandra Thapa, Jason Xue,
- Abstract summary: We propose a lightweight synthetic data generator that ensures differential privacy while being resource-efficient.<n>$textsfPrivHP$ generates private synthetic data that preserves the input stream's distribution.<n>It can process a dataset of size $n$ in $mathcalO((w+k)log (varepsilon n))$ space, $mathcalO(log (varepsilon n))$ update time, and outputs a private synthetic data generator in $mathcalO(klog klog (
- Score: 8.913413757749066
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Protecting sensitive information on data streams is a critical challenge for modern systems. Current approaches to privacy in data streams follow two strategies. The first transforms the stream into a private sequence, enabling the use of non-private analyses but incurring high memory costs. The second uses compact data structures to create private summaries but restricts flexibility to predefined queries. To address these limitations, we propose $\textsf{PrivHP}$, a lightweight synthetic data generator that ensures differential privacy while being resource-efficient. $\textsf{PrivHP}$ generates private synthetic data that preserves the input stream's distribution, allowing flexible downstream analyses without additional privacy costs. It leverages a hierarchical decomposition of the domain, pruning low-frequency subdomains while preserving high-frequency ones in a privacy-preserving manner. To achieve memory efficiency in streaming contexts, $\textsf{PrivHP}$ uses private sketches to estimate subdomain frequencies without accessing the full dataset. $\textsf{PrivHP}$ is parameterized by a privacy budget $\varepsilon$, a pruning parameter $k$ and the sketch width $w$. It can process a dataset of size $n$ in $\mathcal{O}((w+k)\log (\varepsilon n))$ space, $\mathcal{O}(\log (\varepsilon n))$ update time, and outputs a private synthetic data generator in $\mathcal{O}(k\log k\log (\varepsilon n))$ time. Prior methods require $\Omega(n)$ space and construction time. Our evaluation uses the expected 1-Wasserstein distance between the sampler and the empirical distribution. Compared to state-of-the-art methods, we demonstrate that the additional cost in utility is inversely proportional to $k$ and $w$. This represents the first meaningful trade-off between performance and utility for private synthetic data generation.
Related papers
- Nonparametric MLE for Gaussian Location Mixtures: Certified Computation and Generic Behavior [28.71736321665378]
We study the nonparametric maximum likelihood estimator $widehatpi$ for Gaussian location mixtures in one dimension.
We provide an algorithm which for small enough $varepsilon>0$ computes an $varepsilon$-approximation of $widehatpi$ in Wasserstein distance in time.
We also show the distribution of $widehatpi$ conditioned to be $k$-atomic admits a density on the associated $2k-1$ dimensional parameter space.
arXiv Detail & Related papers (2025-03-26T03:36:36Z) - Optimized Tradeoffs for Private Prediction with Majority Ensembling [59.99331405291337]
We introduce the Data-dependent Randomized Response Majority (DaRRM) algorithm.<n>DaRRM is parameterized by a data-dependent noise function $gamma$, and enables efficient utility optimization over the class of all private algorithms.<n>We show that DaRRM provably enjoys a privacy gain of a factor of 2 over common baselines, with fixed utility.
arXiv Detail & Related papers (2024-11-27T00:48:48Z) - Fast John Ellipsoid Computation with Differential Privacy Optimization [34.437362489150246]
We present the first differentially private algorithm for fast John ellipsoid computation.
Our method integrates noise perturbation with sketching and leverage score sampling to achieve both efficiency and privacy.
arXiv Detail & Related papers (2024-08-12T03:47:55Z) - Profile Reconstruction from Private Sketches [13.929335175122265]
Given a multiset of $n$ items from $mathcalD$, the emphprofile reconstruction problem is to estimate, for $t = 0, 1, dots, n$, the fraction $vecf[t]$ of items in $mathcalD$ that appear exactly $tfty times.
We consider differentially private profile estimation in a distributed, space-constrained setting where we wish to maintain an updatable, private sketch of the multiset.
We show how to speed up their LP-based technique
arXiv Detail & Related papers (2024-06-03T09:51:28Z) - Online Differentially Private Synthetic Data Generation [10.177542186664503]
We develop an online algorithm that generates a differentially private synthetic dataset at each time $t$.
This algorithm achieves a near-optimal accuracy bound of $O(log(t)t-1/d)$ for $dgeq 2$ and $O(log4.5(t)t-1)$ for $d=1$ in the 1-Wasserstein distance.
arXiv Detail & Related papers (2024-02-12T19:21:14Z) - Estimation and Inference in Distributional Reinforcement Learning [28.253677740976197]
We show that a dataset of size $widetilde Oleft(frac|mathcalS||mathcalA|epsilon2 (1-gamma)4right)$ suffices to ensure the Kolmogorov metric and total variation metric between $hatetapi$ and $etapi$ is below $epsilon$ with high probability.
Our findings give rise to a unified approach to statistical inference of a wide class of statistical functionals of $etapi$.
arXiv Detail & Related papers (2023-09-29T14:14:53Z) - Differentially Private Clustering in Data Streams [65.78882209673885]
We present a differentially private streaming clustering framework which only requires an offline DP coreset or clustering algorithm as a blackbox.
Our framework is also differentially private under the continual release setting, i.e., the union of outputs of our algorithms at every timestamp is always differentially private.
arXiv Detail & Related papers (2023-07-14T16:11:22Z) - Private Isotonic Regression [54.32252900997422]
We consider the problem of isotonic regression over a partially ordered set (poset) $mathcalX$ and for any Lipschitz loss function.
We obtain a pure-DP algorithm that has an expected excess empirical risk of roughly $mathrmwidth(mathcalX) cdot log|mathcalX| / n$, where $mathrmwidth(mathcalX)$ is the width of the poset.
We show that the bounds above are essentially the best that can be
arXiv Detail & Related papers (2022-10-27T05:08:07Z) - Smooth Anonymity for Sparse Graphs [69.1048938123063]
differential privacy has emerged as the gold standard of privacy, however, when it comes to sharing sparse datasets.
In this work, we consider a variation of $k$-anonymity, which we call smooth-$k$-anonymity, and design simple large-scale algorithms that efficiently provide smooth-$k$-anonymity.
arXiv Detail & Related papers (2022-07-13T17:09:25Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Private Convex Optimization via Exponential Mechanism [16.867534746193833]
We show that modifying the exponential mechanism by adding an $ellcave2$ regularizer to $F(x)$ recovers both the known optimal empirical risk and population loss under $(epsilon,delta)$-DP.
We also show how to implement this mechanism using $widetildeO(n min(d, n)) queries to $f_i(x) for the DP-SCO where $n$ is the number of samples/optimal and $d is the ambient dimension.
arXiv Detail & Related papers (2022-03-01T06:51:03Z) - Sampling from Log-Concave Distributions with Infinity-Distance
Guarantees and Applications to Differentially Private Optimization [33.38289436686841]
We present an algorithm that outputs a point from a distributionO(varepsilon)$close to $$ in infinity-distance.
We also present a "soft-pi" version of the Dikin walk which may be independent interest.
arXiv Detail & Related papers (2021-11-07T13:44:50Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Threshold Phenomena in Learning Halfspaces with Massart Noise [56.01192577666607]
We study the problem of PAC learning halfspaces on $mathbbRd$ with Massart noise under Gaussian marginals.
Our results qualitatively characterize the complexity of learning halfspaces in the Massart model.
arXiv Detail & Related papers (2021-08-19T16:16:48Z) - Hiding Among the Clones: A Simple and Nearly Optimal Analysis of Privacy
Amplification by Shuffling [49.43288037509783]
We show that random shuffling amplifies differential privacy guarantees of locally randomized data.
Our result is based on a new approach that is simpler than previous work and extends to approximate differential privacy with nearly the same guarantees.
arXiv Detail & Related papers (2020-12-23T17:07:26Z) - Optimal Mean Estimation without a Variance [103.26777953032537]
We study the problem of heavy-tailed mean estimation in settings where the variance of the data-generating distribution does not exist.
We design an estimator which attains the smallest possible confidence interval as a function of $n,d,delta$.
arXiv Detail & Related papers (2020-11-24T22:39:21Z) - BUDS: Balancing Utility and Differential Privacy by Shuffling [3.618133010429131]
Balancing utility and differential privacy by shuffling or textitBUDS is an approach towards crowd-sourced, statistical databases.
New algorithm is proposed using one-hot encoding and iterative shuffling with the loss estimation and risk minimization techniques.
During empirical test of balanced utility and privacy, BUDS produces $epsilon = 0.02$ which is a very promising result.
arXiv Detail & Related papers (2020-06-07T11:39:13Z) - On the Complexity of Minimizing Convex Finite Sums Without Using the
Indices of the Individual Functions [62.01594253618911]
We exploit the finite noise structure of finite sums to derive a matching $O(n2)$-upper bound under the global oracle model.
Following a similar approach, we propose a novel adaptation of SVRG which is both emphcompatible with oracles, and achieves complexity bounds of $tildeO(n2+nsqrtL/mu)log (1/epsilon)$ and $O(nsqrtL/epsilon)$, for $mu>0$ and $mu=0$
arXiv Detail & Related papers (2020-02-09T03:39:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.