Differentially Private Clustering in Data Streams
- URL: http://arxiv.org/abs/2307.07449v2
- Date: Mon, 8 Jan 2024 02:32:23 GMT
- Title: Differentially Private Clustering in Data Streams
- Authors: Alessandro Epasto, Tamalika Mukherjee, Peilin Zhong
- Abstract summary: We present a differentially private streaming clustering framework which only requires an offline DP coreset or clustering algorithm as a blackbox.
Our framework is also differentially private under the continual release setting, i.e., the union of outputs of our algorithms at every timestamp is always differentially private.
- Score: 65.78882209673885
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The streaming model is an abstraction of computing over massive data streams,
which is a popular way of dealing with large-scale modern data analysis. In
this model, there is a stream of data points, one after the other. A streaming
algorithm is only allowed one pass over the data stream, and the goal is to
perform some analysis during the stream while using as small space as possible.
Clustering problems (such as $k$-means and $k$-median) are fundamental
unsupervised machine learning primitives, and streaming clustering algorithms
have been extensively studied in the past. However, since data privacy becomes
a central concern in many real-world applications, non-private clustering
algorithms are not applicable in many scenarios.
In this work, we provide the first differentially private streaming
algorithms for $k$-means and $k$-median clustering of $d$-dimensional Euclidean
data points over a stream with length at most $T$ using $poly(k,d,\log(T))$
space to achieve a constant multiplicative error and a $poly(k,d,\log(T))$
additive error. In particular, we present a differentially private streaming
clustering framework which only requires an offline DP coreset or clustering
algorithm as a blackbox. By plugging in existing results from DP clustering
Ghazi, Kumar, Manurangsi 2020 and Kaplan, Stemmer 2018, we achieve (1) a
$(1+\gamma)$-multiplicative approximation with
$\tilde{O}_\gamma(poly(k,d,\log(T)))$ space for any $\gamma>0$, and the
additive error is $poly(k,d,\log(T))$ or (2) an $O(1)$-multiplicative
approximation with $\tilde{O}(k^{1.5} \cdot poly(d,\log(T)))$ space and
$poly(k,d,\log(T))$ additive error. In addition, our algorithmic framework is
also differentially private under the continual release setting, i.e., the
union of outputs of our algorithms at every timestamp is always differentially
private.
Related papers
- Online Differentially Private Synthetic Data Generation [10.177542186664503]
We develop an online algorithm that generates a differentially private synthetic dataset at each time $t$.
This algorithm achieves a near-optimal accuracy bound of $O(log(t)t-1/d)$ for $dgeq 2$ and $O(log4.5(t)t-1)$ for $d=1$ in the 1-Wasserstein distance.
arXiv Detail & Related papers (2024-02-12T19:21:14Z) - Near-Optimal Differentially Private k-Core Decomposition [2.859324824091086]
We show that an $eps$-edge differentially private algorithm for $k$-core decomposition outputs the core numbers with no multiplicative error and $O(textlog(n)/eps)$ additive error.
This improves upon previous work by a factor of 2 in the multiplicative error, while giving near-optimal additive error.
arXiv Detail & Related papers (2023-12-12T20:09:07Z) - Simple, Scalable and Effective Clustering via One-Dimensional
Projections [10.807367640692021]
Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis.
We introduce a simple randomized clustering algorithm that provably runs in expected time $O(mathrmnnz(X) + nlog n)$ for arbitrary $k$.
We prove that our algorithm achieves approximation ratio $smashwidetildeO(k4)$ on any input dataset for the $k$-means objective.
arXiv Detail & Related papers (2023-10-25T16:37:45Z) - Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - Differentially-Private Hierarchical Clustering with Provable
Approximation Guarantees [79.59010418610625]
We study differentially private approximation algorithms for hierarchical clustering.
We show strong lower bounds for the problem: that any $epsilon$-DP algorithm must exhibit $O(|V|2/ epsilon)$-additive error for an input dataset.
We propose a private $1+o(1)$ approximation algorithm which also recovers the blocks exactly.
arXiv Detail & Related papers (2023-01-31T19:14:30Z) - Scalable Differentially Private Clustering via Hierarchically Separated
Trees [82.69664595378869]
We show that our method computes a solution with cost at most $O(d3/2log n)cdot OPT + O(k d2 log2 n / epsilon2)$, where $epsilon$ is the privacy guarantee.
Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical.
arXiv Detail & Related papers (2022-06-17T09:24:41Z) - Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset.
We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z) - Fair and Representative Subset Selection from Data Streams [4.53279507109072]
We consider the setting where data items in the stream belong to one of several disjoint groups.
We propose efficient algorithms for the fairness-aware variant of the streaming submodular problem.
arXiv Detail & Related papers (2020-10-09T07:49:13Z) - Streaming Complexity of SVMs [110.63976030971106]
We study the space complexity of solving the bias-regularized SVM problem in the streaming model.
We show that for both problems, for dimensions of $frac1lambdaepsilon$, one can obtain streaming algorithms with spacely smaller than $frac1lambdaepsilon$.
arXiv Detail & Related papers (2020-07-07T17:10:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.