Online Coresets for Clustering with Bregman Divergences
- URL: http://arxiv.org/abs/2012.06522v1
- Date: Fri, 11 Dec 2020 17:39:21 GMT
- Title: Online Coresets for Clustering with Bregman Divergences
- Authors: Rachit Chhaya, Jayesh Choudhari, Anirban Dasgupta, Supratim Shit
- Abstract summary: We present algorithms that create coresets in an online setting for clustering problems according to a wide subset of Bregman divergences.
Notably, our coresets have a small additive error, similar in magnitude to the lightweight coresets Bachem et. al. 2018.
- Score: 11.650724544695498
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present algorithms that create coresets in an online setting for
clustering problems according to a wide subset of Bregman divergences. Notably,
our coresets have a small additive error, similar in magnitude to the
lightweight coresets Bachem et. al. 2018, and take update time $O(d)$ for every
incoming point where $d$ is dimension of the point. Our first algorithm gives
online coresets of size $\tilde{O}(\mbox{poly}(k,d,\epsilon,\mu))$ for
$k$-clusterings according to any $\mu$-similar Bregman divergence. We further
extend this algorithm to show existence of a non-parametric coresets, where the
coreset size is independent of $k$, the number of clusters, for the same
subclass of Bregman divergences. Our non-parametric coresets are larger by a
factor of $O(\log n)$ ($n$ is number of points) and have similar (small)
additive guarantee. At the same time our coresets also function as lightweight
coresets for non-parametric versions of the Bregman clustering like DP-Means.
While these coresets provide additive error guarantees, they are also
significantly smaller (scaling with $O(\log n)$ as opposed to $O(d^d)$ for
points in $\~R^d$) than the (relative-error) coresets obtained in Bachem et.
al. 2015 for DP-Means. While our non-parametric coresets are existential, we
give an algorithmic version under certain assumptions.
Related papers
- A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al.
A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$.
We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z) - Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - Differentially Private Clustering in Data Streams [65.78882209673885]
We present a differentially private streaming clustering framework which only requires an offline DP coreset or clustering algorithm as a blackbox.
Our framework is also differentially private under the continual release setting, i.e., the union of outputs of our algorithms at every timestamp is always differentially private.
arXiv Detail & Related papers (2023-07-14T16:11:22Z) - Universal Weak Coreset [3.1509756165776635]
A weak coreset is a pair $(J,S)$ of subsets of points, where $S$ acts as a summary of the point set and $J$ as a set of potential centers.
We develop this framework, which we call universal weak coresets, for constrained clustering settings.
arXiv Detail & Related papers (2023-05-26T12:51:16Z) - Coresets for Time Series Clustering [33.801060211529354]
We study the problem of constructing coresets for clustering problems with time series data.
Our main contribution is an algorithm to construct coresets for a mixture model.
We empirically assess the performance of our coreset with synthetic data.
arXiv Detail & Related papers (2021-10-28T16:21:13Z) - FriendlyCore: Practical Differentially Private Aggregation [67.04951703461657]
We propose a simple and practical tool $mathsfFriendlyCore$ that takes a set of points $cal D$ from an unrestricted (pseudo) metric space as input.
When $cal D$ has effective diameter $r$, $mathsfFriendlyCore$ returns a "stable" subset $cal D_Gsubseteq cal D$ that includes all points.
$mathsfFriendlyCore$ can be used to preprocess the input before privately aggregating it, potentially simplifying the aggregation or boosting its accuracy
arXiv Detail & Related papers (2021-10-19T17:43:50Z) - Nearly-Tight and Oblivious Algorithms for Explainable Clustering [8.071379672971542]
We study the problem of explainable clustering in the setting first formalized by Moshkovitz, Dasgupta, Rashtchian, and Frost (ICML 2020)
A $k$-clustering is said to be explainable if it is given by a decision tree where each internal node data points with a threshold cut in a single dimension (feature)
We give an algorithm that outputs an explainable clustering that loses at most a factor of $O(log2 k)$ compared to an optimal (not necessarily explainable) clustering for the $k$-medians objective.
arXiv Detail & Related papers (2021-06-30T15:49:41Z) - Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset.
We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.