Simple, Scalable and Effective Clustering via One-Dimensional
Projections
- URL: http://arxiv.org/abs/2310.16752v1
- Date: Wed, 25 Oct 2023 16:37:45 GMT
- Title: Simple, Scalable and Effective Clustering via One-Dimensional
Projections
- Authors: Moses Charikar, Monika Henzinger, Lunjia Hu, Maxmilian V\"otsch, Erik
Waingarten
- Abstract summary: Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis.
We introduce a simple randomized clustering algorithm that provably runs in expected time $O(mathrmnnz(X) + nlog n)$ for arbitrary $k$.
We prove that our algorithm achieves approximation ratio $smashwidetildeO(k4)$ on any input dataset for the $k$-means objective.
- Score: 10.807367640692021
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Clustering is a fundamental problem in unsupervised machine learning with
many applications in data analysis. Popular clustering algorithms such as
Lloyd's algorithm and $k$-means++ can take $\Omega(ndk)$ time when clustering
$n$ points in a $d$-dimensional space (represented by an $n\times d$ matrix
$X$) into $k$ clusters. In applications with moderate to large $k$, the
multiplicative $k$ factor can become very expensive. We introduce a simple
randomized clustering algorithm that provably runs in expected time
$O(\mathrm{nnz}(X) + n\log n)$ for arbitrary $k$. Here $\mathrm{nnz}(X)$ is the
total number of non-zero entries in the input dataset $X$, which is upper
bounded by $nd$ and can be significantly smaller for sparse datasets. We prove
that our algorithm achieves approximation ratio $\smash{\widetilde{O}(k^4)}$ on
any input dataset for the $k$-means objective. We also believe that our
theoretical analysis is of independent interest, as we show that the
approximation ratio of a $k$-means algorithm is approximately preserved under a
class of projections and that $k$-means++ seeding can be implemented in
expected $O(n \log n)$ time in one dimension. Finally, we show experimentally
that our clustering algorithm gives a new tradeoff between running time and
cluster quality compared to previous state-of-the-art methods for these tasks.
Related papers
- A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al.
A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$.
We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z) - Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - Differentially Private Clustering in Data Streams [65.78882209673885]
We present a differentially private streaming clustering framework which only requires an offline DP coreset or clustering algorithm as a blackbox.
Our framework is also differentially private under the continual release setting, i.e., the union of outputs of our algorithms at every timestamp is always differentially private.
arXiv Detail & Related papers (2023-07-14T16:11:22Z) - Near-Optimal Quantum Coreset Construction Algorithms for Clustering [15.513270929560088]
We give quantum algorithms that find coresets for $k$-clustering in $mathbbRd$ with $tildeO(sqrtnkd3/2)$ query complexity.
Our coreset reduces the input size from $n$ to $mathrmpoly(kepsilon-1d)$, so that existing $alpha$-approximation algorithms for clustering can run on top of it.
arXiv Detail & Related papers (2023-06-05T12:22:46Z) - Improved Learning-augmented Algorithms for k-means and k-medians
Clustering [8.04779839951237]
We consider the problem of clustering in the learning-augmented setting, where we are given a data set in $d$-dimensional Euclidean space.
We propose a deterministic $k$-means algorithm that produces centers with improved bound on clustering cost.
Our algorithm works even when the predictions are not very accurate, i.e. our bound holds for $alpha$ up to $1/2$, an improvement over $alpha$ being at most $1/7$ in the previous work.
arXiv Detail & Related papers (2022-10-31T03:00:11Z) - Nearly-Tight and Oblivious Algorithms for Explainable Clustering [8.071379672971542]
We study the problem of explainable clustering in the setting first formalized by Moshkovitz, Dasgupta, Rashtchian, and Frost (ICML 2020)
A $k$-clustering is said to be explainable if it is given by a decision tree where each internal node data points with a threshold cut in a single dimension (feature)
We give an algorithm that outputs an explainable clustering that loses at most a factor of $O(log2 k)$ compared to an optimal (not necessarily explainable) clustering for the $k$-medians objective.
arXiv Detail & Related papers (2021-06-30T15:49:41Z) - Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset.
We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions.
Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset.
A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z) - Explainable $k$-Means and $k$-Medians Clustering [25.513261099927163]
We consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner.
We show that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost.
We design an efficient algorithm that produces explainable clusters using a tree with $k$ leaves.
arXiv Detail & Related papers (2020-02-28T04:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.