Related papers: Differentially Private Clustering in Data Streams

Differentially Private Clustering in Data Streams

URL: http://arxiv.org/abs/2307.07449v2
Date: Mon, 8 Jan 2024 02:32:23 GMT
Title: Differentially Private Clustering in Data Streams
Authors: Alessandro Epasto, Tamalika Mukherjee, Peilin Zhong
Abstract summary: We present a differentially private streaming clustering framework which only requires an offline DP coreset or clustering algorithm as a blackbox. Our framework is also differentially private under the continual release setting, i.e., the union of outputs of our algorithms at every timestamp is always differentially private.
Score: 65.78882209673885
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The streaming model is an abstraction of computing over massive data streams, which is a popular way of dealing with large-scale modern data analysis. In this model, there is a stream of data points, one after the other. A streaming algorithm is only allowed one pass over the data stream, and the goal is to perform some analysis during the stream while using as small space as possible. Clustering problems (such as $k$-means and $k$-median) are fundamental unsupervised machine learning primitives, and streaming clustering algorithms have been extensively studied in the past. However, since data privacy becomes a central concern in many real-world applications, non-private clustering algorithms are not applicable in many scenarios. In this work, we provide the first differentially private streaming algorithms for $k$-means and $k$-median clustering of $d$-dimensional Euclidean data points over a stream with length at most $T$ using $poly(k,d,\log(T))$ space to achieve a constant multiplicative error and a $poly(k,d,\log(T))$ additive error. In particular, we present a differentially private streaming clustering framework which only requires an offline DP coreset or clustering algorithm as a blackbox. By plugging in existing results from DP clustering Ghazi, Kumar, Manurangsi 2020 and Kaplan, Stemmer 2018, we achieve (1) a $(1+\gamma)$-multiplicative approximation with $\tilde{O}_\gamma(poly(k,d,\log(T)))$ space for any $\gamma>0$, and the additive error is $poly(k,d,\log(T))$ or (2) an $O(1)$-multiplicative approximation with $\tilde{O}(k^{1.5} \cdot poly(d,\log(T)))$ space and $poly(k,d,\log(T))$ additive error. In addition, our algorithmic framework is also differentially private under the continual release setting, i.e., the union of outputs of our algorithms at every timestamp is always differentially private.

Related papers

Private Continual Counting of Unbounded Streams [11.941250828872189]
We study the problem of differentially private continual counting in the unbounded setting where the input size $n$ is not known in advance.<n>Using the common doubling trick' avoids knowledge of $n$ but leads to suboptimal and non-smooth error.<n>We introduce novel matrix factorizations based on logarithmic perturbations of the function $frac1sqrt1-z$ studied in prior works.
arXiv Detail & Related papers (2025-06-17T23:09:53Z)
Differentially Private Space-Efficient Algorithms for Counting Distinct Elements in the Turnstile Model [61.40326886123332]
We give the first sublinear space differentially private algorithms for the fundamental problem of counting distinct elements in the turnstile streaming model.<n>Our result significantly improves upon the space requirements of the state-of-the-art algorithms for this problem, which is linear.<n>When a bound $W$ on the number of times an item appears in the stream is known, our algorithm provides $tildeO_eta(sqrtW)$ additive error using $tildeO_eta(sqrtW)$ space.
arXiv Detail & Related papers (2025-05-29T17:21:20Z)
Online Differentially Private Synthetic Data Generation [10.177542186664503]
We develop an online algorithm that generates a differentially private synthetic dataset at each time $t$. This algorithm achieves a near-optimal accuracy bound of $O(log(t)t-1/d)$ for $dgeq 2$ and $O(log4.5(t)t-1)$ for $d=1$ in the 1-Wasserstein distance.
arXiv Detail & Related papers (2024-02-12T19:21:14Z)
Near-Optimal Differentially Private k-Core Decomposition [2.859324824091086]
We show that an $eps$-edge differentially private algorithm for $k$-core decomposition outputs the core numbers with no multiplicative error and $O(textlog(n)/eps)$ additive error. This improves upon previous work by a factor of 2 in the multiplicative error, while giving near-optimal additive error.
arXiv Detail & Related papers (2023-12-12T20:09:07Z)
Simple, Scalable and Effective Clustering via One-Dimensional Projections [10.807367640692021]
Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis. We introduce a simple randomized clustering algorithm that provably runs in expected time $O(mathrmnnz(X) + nlog n)$ for arbitrary $k$. We prove that our algorithm achieves approximation ratio $smashwidetildeO(k4)$ on any input dataset for the $k$-means objective.
arXiv Detail & Related papers (2023-10-25T16:37:45Z)
Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets. We present an improved version of the "$q$-means" algorithm for clustering. We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z)
Data Structures for Density Estimation [66.36971978162461]
Given a sublinear (in $n$) number of samples from $p$, our main result is the first data structure that identifies $v_i$ in time sublinear in $k$. We also give an improved version of the algorithm of Acharya et al. that reports $v_i$ in time linear in $k$.
arXiv Detail & Related papers (2023-06-20T06:13:56Z)
Differentially-Private Hierarchical Clustering with Provable Approximation Guarantees [79.59010418610625]
We study differentially private approximation algorithms for hierarchical clustering. We show strong lower bounds for the problem: that any $epsilon$-DP algorithm must exhibit $O(|V|2/ epsilon)$-additive error for an input dataset. We propose a private $1+o(1)$ approximation algorithm which also recovers the blocks exactly.
arXiv Detail & Related papers (2023-01-31T19:14:30Z)
Scalable Differentially Private Clustering via Hierarchically Separated Trees [82.69664595378869]
We show that our method computes a solution with cost at most $O(d3/2log n)cdot OPT + O(k d2 log2 n / epsilon2)$, where $epsilon$ is the privacy guarantee. Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical.
arXiv Detail & Related papers (2022-06-17T09:24:41Z)
Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset. We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z)
Fair and Representative Subset Selection from Data Streams [4.53279507109072]
We consider the setting where data items in the stream belong to one of several disjoint groups. We propose efficient algorithms for the fairness-aware variant of the streaming submodular problem.
arXiv Detail & Related papers (2020-10-09T07:49:13Z)
Streaming Complexity of SVMs [110.63976030971106]
We study the space complexity of solving the bias-regularized SVM problem in the streaming model. We show that for both problems, for dimensions of $frac1lambdaepsilon$, one can obtain streaming algorithms with spacely smaller than $frac1lambdaepsilon$.
arXiv Detail & Related papers (2020-07-07T17:10:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.