Differentially Private Clustering in Data Streams
- URL: http://arxiv.org/abs/2307.07449v3
- Date: Thu, 02 Oct 2025 13:35:31 GMT
- Title: Differentially Private Clustering in Data Streams
- Authors: Alessandro Epasto, Tamalika Mukherjee, Peilin Zhong,
- Abstract summary: We provide the first differentially private algorithms for $k$-means and $k$-median clustering of $d$-dimensional Euclidean data points over a stream with length at most $T$.<n>Our main technical contribution is a differentially private clustering framework for data streams which only requires an offline DP coreset or clustering algorithm as a blackbox.
- Score: 56.26040303056582
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Clustering problems (such as $k$-means and $k$-median) are fundamental unsupervised machine learning primitives, and streaming clustering algorithms have been extensively studied in the past. However, since data privacy becomes a central concern in many real-world applications, non-private clustering algorithms may not be as applicable in many scenarios. In this work, we provide the first differentially private algorithms for $k$-means and $k$-median clustering of $d$-dimensional Euclidean data points over a stream with length at most $T$ using space that is sublinear (in $T$) in the continual release setting where the algorithm is required to output a clustering at every timestep. We achieve (1) an $O(1)$-multiplicative approximation with $\tilde{O}(k^{1.5} \cdot poly(d,\log(T)))$ space and $poly(k,d,\log(T))$ additive error, or (2) a $(1+\gamma)$-multiplicative approximation with $\tilde{O}_\gamma(poly(k,2^{O_\gamma(d)},\log(T)))$ space for any $\gamma>0$, and the additive error is $poly(k,2^{O_\gamma(d)},\log(T))$. Our main technical contribution is a differentially private clustering framework for data streams which only requires an offline DP coreset or clustering algorithm as a blackbox.
Related papers
- Computational Hardness of Private Coreset [84.99100741615423]
For a given input set of points, a coreset is another set of points such that the $k$-means objective for any candidate solution is preserved up to a multiplicative $(, 1/n(1))$ factor (and some additive factor)<n>We show that no-time $(, 1/n(1))$-DP algorithm can compute a coreset for $k$-means in the $ell_infty$-metric for some constant $> 0$ (and some constant additive factor)<n>For $k$-means in the
arXiv Detail & Related papers (2026-02-19T15:58:49Z) - Incremental (k, z)-Clustering on Graphs [2.3322477552758234]
We develop a randomized incremental $(k, z)$-clustering algorithm that maintains with high probability a constant-factor approximation.<n>In the first stage, we maintain a constant-factor bicriteria approximate solution of size $tildeO(k)$ with a total update time of $m1+o(1)$ over all adversarial edge insertions.<n>In the second stage, we maintain a constant-factor approximate $(k,z)$-clustering solution on a dynamic weighted instance induced by the bicriteria approximate solution.
arXiv Detail & Related papers (2026-02-09T11:43:10Z) - A Broader View on Clustering under Cluster-Aware Norm Objectives [0.8016305126057338]
We revisit the $(f,g)$-clustering problem that we introduced in a recent work.<n>This problem assigns each of the $k$ clusters a cost determined by the monotone, symmetric norm $f$ applied to the vector distances in the cluster.<n>We provide an $O(k)$-approximation for the general $(f,g)$-clustering problem.
arXiv Detail & Related papers (2025-12-05T23:14:10Z) - Private Continual Counting of Unbounded Streams [11.941250828872189]
We study the problem of differentially private continual counting in the unbounded setting where the input size $n$ is not known in advance.<n>Using the common doubling trick' avoids knowledge of $n$ but leads to suboptimal and non-smooth error.<n>We introduce novel matrix factorizations based on logarithmic perturbations of the function $frac1sqrt1-z$ studied in prior works.
arXiv Detail & Related papers (2025-06-17T23:09:53Z) - Differentially Private Space-Efficient Algorithms for Counting Distinct Elements in the Turnstile Model [61.40326886123332]
We give the first sublinear space differentially private algorithms for the fundamental problem of counting distinct elements in the turnstile streaming model.<n>Our result significantly improves upon the space requirements of the state-of-the-art algorithms for this problem, which is linear.<n>When a bound $W$ on the number of times an item appears in the stream is known, our algorithm provides $tildeO_eta(sqrtW)$ additive error using $tildeO_eta(sqrtW)$ space.
arXiv Detail & Related papers (2025-05-29T17:21:20Z) - Online Differentially Private Synthetic Data Generation [10.177542186664503]
We develop an online algorithm that generates a differentially private synthetic dataset at each time $t$.
This algorithm achieves a near-optimal accuracy bound of $O(log(t)t-1/d)$ for $dgeq 2$ and $O(log4.5(t)t-1)$ for $d=1$ in the 1-Wasserstein distance.
arXiv Detail & Related papers (2024-02-12T19:21:14Z) - Near-Optimal Differentially Private k-Core Decomposition [2.859324824091086]
We show that an $eps$-edge differentially private algorithm for $k$-core decomposition outputs the core numbers with no multiplicative error and $O(textlog(n)/eps)$ additive error.
This improves upon previous work by a factor of 2 in the multiplicative error, while giving near-optimal additive error.
arXiv Detail & Related papers (2023-12-12T20:09:07Z) - Simple, Scalable and Effective Clustering via One-Dimensional
Projections [10.807367640692021]
Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis.
We introduce a simple randomized clustering algorithm that provably runs in expected time $O(mathrmnnz(X) + nlog n)$ for arbitrary $k$.
We prove that our algorithm achieves approximation ratio $smashwidetildeO(k4)$ on any input dataset for the $k$-means objective.
arXiv Detail & Related papers (2023-10-25T16:37:45Z) - Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - Data Structures for Density Estimation [66.36971978162461]
Given a sublinear (in $n$) number of samples from $p$, our main result is the first data structure that identifies $v_i$ in time sublinear in $k$.
We also give an improved version of the algorithm of Acharya et al. that reports $v_i$ in time linear in $k$.
arXiv Detail & Related papers (2023-06-20T06:13:56Z) - Differentially-Private Hierarchical Clustering with Provable
Approximation Guarantees [79.59010418610625]
We study differentially private approximation algorithms for hierarchical clustering.
We show strong lower bounds for the problem: that any $epsilon$-DP algorithm must exhibit $O(|V|2/ epsilon)$-additive error for an input dataset.
We propose a private $1+o(1)$ approximation algorithm which also recovers the blocks exactly.
arXiv Detail & Related papers (2023-01-31T19:14:30Z) - Improved Learning-augmented Algorithms for k-means and k-medians
Clustering [8.04779839951237]
We consider the problem of clustering in the learning-augmented setting, where we are given a data set in $d$-dimensional Euclidean space.
We propose a deterministic $k$-means algorithm that produces centers with improved bound on clustering cost.
Our algorithm works even when the predictions are not very accurate, i.e. our bound holds for $alpha$ up to $1/2$, an improvement over $alpha$ being at most $1/7$ in the previous work.
arXiv Detail & Related papers (2022-10-31T03:00:11Z) - Scalable Differentially Private Clustering via Hierarchically Separated
Trees [82.69664595378869]
We show that our method computes a solution with cost at most $O(d3/2log n)cdot OPT + O(k d2 log2 n / epsilon2)$, where $epsilon$ is the privacy guarantee.
Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical.
arXiv Detail & Related papers (2022-06-17T09:24:41Z) - Near-Optimal Explainable $k$-Means for All Dimensions [13.673697350508965]
We show an efficient algorithm that finds an explainable clustering whose $k$-means cost is at most $k1 - 2/dmathrmpoly(dlog k)$.
We get an improved bound of $k1 - 2/dmathrmpolylog(k)$, which we show is optimal for every choice of $k,dge 2$ up to a poly-logarithmic factor in $k$.
arXiv Detail & Related papers (2021-06-29T16:59:03Z) - Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset.
We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z) - Fair and Representative Subset Selection from Data Streams [4.53279507109072]
We consider the setting where data items in the stream belong to one of several disjoint groups.
We propose efficient algorithms for the fairness-aware variant of the streaming submodular problem.
arXiv Detail & Related papers (2020-10-09T07:49:13Z) - Streaming Complexity of SVMs [110.63976030971106]
We study the space complexity of solving the bias-regularized SVM problem in the streaming model.
We show that for both problems, for dimensions of $frac1lambdaepsilon$, one can obtain streaming algorithms with spacely smaller than $frac1lambdaepsilon$.
arXiv Detail & Related papers (2020-07-07T17:10:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.