Nearly-Tight and Oblivious Algorithms for Explainable Clustering
- URL: http://arxiv.org/abs/2106.16147v1
- Date: Wed, 30 Jun 2021 15:49:41 GMT
- Title: Nearly-Tight and Oblivious Algorithms for Explainable Clustering
- Authors: Buddhima Gamlath, Xinrui Jia, Adam Polak, Ola Svensson
- Abstract summary: We study the problem of explainable clustering in the setting first formalized by Moshkovitz, Dasgupta, Rashtchian, and Frost (ICML 2020)
A $k$-clustering is said to be explainable if it is given by a decision tree where each internal node data points with a threshold cut in a single dimension (feature)
We give an algorithm that outputs an explainable clustering that loses at most a factor of $O(log2 k)$ compared to an optimal (not necessarily explainable) clustering for the $k$-medians objective.
- Score: 8.071379672971542
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of explainable clustering in the setting first
formalized by Moshkovitz, Dasgupta, Rashtchian, and Frost (ICML 2020). A
$k$-clustering is said to be explainable if it is given by a decision tree
where each internal node splits data points with a threshold cut in a single
dimension (feature), and each of the $k$ leaves corresponds to a cluster. We
give an algorithm that outputs an explainable clustering that loses at most a
factor of $O(\log^2 k)$ compared to an optimal (not necessarily explainable)
clustering for the $k$-medians objective, and a factor of $O(k \log^2 k)$ for
the $k$-means objective. This improves over the previous best upper bounds of
$O(k)$ and $O(k^2)$, respectively, and nearly matches the previous $\Omega(\log
k)$ lower bound for $k$-medians and our new $\Omega(k)$ lower bound for
$k$-means. The algorithm is remarkably simple. In particular, given an initial
not necessarily explainable clustering in $\mathbb{R}^d$, it is oblivious to
the data points and runs in time $O(dk \log^2 k)$, independent of the number of
data points $n$. Our upper and lower bounds also generalize to objectives given
by higher $\ell_p$-norms.
Related papers
- A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al.
A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$.
We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z) - Sketching Algorithms for Sparse Dictionary Learning: PTAS and Turnstile
Streaming [48.18845814885398]
We develop new techniques to extend the applicability of sketching-based approaches to sparse dictionary learning and the Euclidean $k$-means clustering problems.
On the fast algorithms front, we obtain a new approach for designing PTAS's for the $k$-means clustering problem.
On the streaming algorithms front, we obtain new upper bounds and lower bounds for dictionary learning and $k$-means clustering.
arXiv Detail & Related papers (2023-10-29T16:46:26Z) - Simple, Scalable and Effective Clustering via One-Dimensional
Projections [10.807367640692021]
Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis.
We introduce a simple randomized clustering algorithm that provably runs in expected time $O(mathrmnnz(X) + nlog n)$ for arbitrary $k$.
We prove that our algorithm achieves approximation ratio $smashwidetildeO(k4)$ on any input dataset for the $k$-means objective.
arXiv Detail & Related papers (2023-10-25T16:37:45Z) - Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - Explainable k-means. Don't be greedy, plant bigger trees! [12.68470213641421]
We provide a new bi-criteria $tildeO(log2 k)$ competitive algorithm for explainable $k$-means clustering.
Explainable $k$-means was recently introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020)
arXiv Detail & Related papers (2021-11-04T23:15:17Z) - Near-optimal Algorithms for Explainable k-Medians and k-Means [12.68470213641421]
We consider the problem of explainable $k$-medians and $k$-means introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian(ICML 2020)
Our goal is to find a emphoptimal decision tree that partitions data into $k clusters and minimizes the $k-medians or $k-means objective.
arXiv Detail & Related papers (2021-07-02T02:07:12Z) - Near-Optimal Explainable $k$-Means for All Dimensions [13.673697350508965]
We show an efficient algorithm that finds an explainable clustering whose $k$-means cost is at most $k1 - 2/dmathrmpoly(dlog k)$.
We get an improved bound of $k1 - 2/dmathrmpolylog(k)$, which we show is optimal for every choice of $k,dge 2$ up to a poly-logarithmic factor in $k$.
arXiv Detail & Related papers (2021-06-29T16:59:03Z) - Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset.
We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - Streaming Complexity of SVMs [110.63976030971106]
We study the space complexity of solving the bias-regularized SVM problem in the streaming model.
We show that for both problems, for dimensions of $frac1lambdaepsilon$, one can obtain streaming algorithms with spacely smaller than $frac1lambdaepsilon$.
arXiv Detail & Related papers (2020-07-07T17:10:00Z) - Explainable $k$-Means and $k$-Medians Clustering [25.513261099927163]
We consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner.
We show that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost.
We design an efficient algorithm that produces explainable clusters using a tree with $k$ leaves.
arXiv Detail & Related papers (2020-02-28T04:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.