Systematically improving existing k-means initialization algorithms at
nearly no cost, by pairwise-nearest-neighbor smoothing
- URL: http://arxiv.org/abs/2202.03949v1
- Date: Tue, 8 Feb 2022 15:56:30 GMT
- Title: Systematically improving existing k-means initialization algorithms at
nearly no cost, by pairwise-nearest-neighbor smoothing
- Authors: Carlo Baldassi
- Abstract summary: We present a meta-method for initializing the $k$-means clustering algorithm called PNN-smoothing.
It consists in splitting a given dataset into $J$ random subsets, clustering each of them individually, and merging the resulting clusterings with the pairwise-nearest-neighbor method.
- Score: 1.2570180539670577
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a meta-method for initializing (seeding) the $k$-means clustering
algorithm called PNN-smoothing. It consists in splitting a given dataset into
$J$ random subsets, clustering each of them individually, and merging the
resulting clusterings with the pairwise-nearest-neighbor (PNN) method. It is a
meta-method in the sense that when clustering the individual subsets any
seeding algorithm can be used. If the computational complexity of that seeding
algorithm is linear in the size of the data $N$ and the number of clusters $k$,
PNN-smoothing is also almost linear with an appropriate choice of $J$, and in
fact only at most a few percent slower in most cases in practice. We show
empirically, using several existing seeding methods and testing on several
synthetic and real datasets, that this procedure results in systematically
better costs. It can even be applied recursively, and easily parallelized. Our
implementation is publicly available at
https://github.com/carlobaldassi/KMeansPNNSmoothing.jl
Related papers
- Almost-linear Time Approximation Algorithm to Euclidean $k$-median and $k$-means [4.271492285528115]
We focus on the Euclidean $k$-median and $k$-means problems, two of the standard ways to model the task of clustering.
In this paper, we almost answer this question by presenting an almost linear-time algorithm to compute a constant-factor approximation.
arXiv Detail & Related papers (2024-07-15T20:04:06Z) - A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al.
A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$.
We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z) - Simple, Scalable and Effective Clustering via One-Dimensional
Projections [10.807367640692021]
Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis.
We introduce a simple randomized clustering algorithm that provably runs in expected time $O(mathrmnnz(X) + nlog n)$ for arbitrary $k$.
We prove that our algorithm achieves approximation ratio $smashwidetildeO(k4)$ on any input dataset for the $k$-means objective.
arXiv Detail & Related papers (2023-10-25T16:37:45Z) - Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - Replicable Clustering [57.19013971737493]
We propose algorithms for the statistical $k$-medians, statistical $k$-means, and statistical $k$-centers problems by utilizing approximation routines for their counterparts in a black-box manner.
We also provide experiments on synthetic distributions in 2D using the $k$-means++ implementation from sklearn as a black-box that validate our theoretical results.
arXiv Detail & Related papers (2023-02-20T23:29:43Z) - Scalable Differentially Private Clustering via Hierarchically Separated
Trees [82.69664595378869]
We show that our method computes a solution with cost at most $O(d3/2log n)cdot OPT + O(k d2 log2 n / epsilon2)$, where $epsilon$ is the privacy guarantee.
Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical.
arXiv Detail & Related papers (2022-06-17T09:24:41Z) - Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset.
We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - Explainable $k$-Means and $k$-Medians Clustering [25.513261099927163]
We consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner.
We show that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost.
We design an efficient algorithm that produces explainable clusters using a tree with $k$ leaves.
arXiv Detail & Related papers (2020-02-28T04:21:53Z) - Learning Sparse Classifiers: Continuous and Mixed Integer Optimization
Perspectives [10.291482850329892]
Mixed integer programming (MIP) can be used to solve (to optimality) $ell_0$-regularized regression problems.
We propose two classes of scalable algorithms: an exact algorithm that can handlepapprox 50,000$ features in a few minutes, and approximate algorithms that can address instances with $papprox6$.
In addition, we present new estimation error bounds for $ell$-regularizeds.
arXiv Detail & Related papers (2020-01-17T18:47:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.