Consistency of Lloyd's Algorithm Under Perturbations
- URL: http://arxiv.org/abs/2309.00578v1
- Date: Fri, 1 Sep 2023 16:45:52 GMT
- Title: Consistency of Lloyd's Algorithm Under Perturbations
- Authors: Dhruv Patel and Hui Shen and Shankar Bhamidi and Yufeng Liu and Vladas
Pipiras
- Abstract summary: Lloyd's algorithm is one of the most widely used clustering algorithms.
We show that the mis-clustering rate of Lloyd's algorithm on samples from a sub-Gaussian mixture is exponentially bounded after $O(log(n))$ iterations.
- Score: 11.56433365648479
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the context of unsupervised learning, Lloyd's algorithm is one of the most
widely used clustering algorithms. It has inspired a plethora of work
investigating the correctness of the algorithm under various settings with
ground truth clusters. In particular, in 2016, Lu and Zhou have shown that the
mis-clustering rate of Lloyd's algorithm on $n$ independent samples from a
sub-Gaussian mixture is exponentially bounded after $O(\log(n))$ iterations,
assuming proper initialization of the algorithm. However, in many applications,
the true samples are unobserved and need to be learned from the data via
pre-processing pipelines such as spectral methods on appropriate data matrices.
We show that the mis-clustering rate of Lloyd's algorithm on perturbed samples
from a sub-Gaussian mixture is also exponentially bounded after $O(\log(n))$
iterations under the assumptions of proper initialization and that the
perturbation is small relative to the sub-Gaussian noise. In canonical settings
with ground truth clusters, we derive bounds for algorithms such as
$k$-means$++$ to find good initializations and thus leading to the correctness
of clustering via the main result. We show the implications of the results for
pipelines measuring the statistical significance of derived clusters from data
such as SigClust. We use these general results to derive implications in
providing theoretical guarantees on the misclustering rate for Lloyd's
algorithm in a host of applications, including high-dimensional time series,
multi-dimensional scaling, and community detection for sparse networks via
spectral clustering.
Related papers
- A Mirror Descent-Based Algorithm for Corruption-Tolerant Distributed Gradient Descent [57.64826450787237]
We show how to analyze the behavior of distributed gradient descent algorithms in the presence of adversarial corruptions.
We show how to use ideas from (lazy) mirror descent to design a corruption-tolerant distributed optimization algorithm.
Experiments based on linear regression, support vector classification, and softmax classification on the MNIST dataset corroborate our theoretical findings.
arXiv Detail & Related papers (2024-07-19T08:29:12Z) - Heteroskedastic Tensor Clustering [20.979358557219953]
We propose a two-stage method, named $mathsfHightext-orderHeteroClustering$ ($mathsfHHC$)
It starts by performing tensor subspace estimation via a novel spectral algorithm called $mathsfThresholdedDeflatedtext-HeteroPCA$, followed by approximate $k$-means to obtain cluster nodes.
Our algorithm provably achieves exact clustering as long as the SNR exceeds the computational limit.
arXiv Detail & Related papers (2023-11-04T02:50:40Z) - A Modular Spatial Clustering Algorithm with Noise Specification [0.0]
Bacteria-Farm algorithm is inspired by the growth of bacteria in closed experimental farms.
In contrast with other clustering algorithms, our algorithm also has a provision to specify the amount of noise to be excluded during clustering.
arXiv Detail & Related papers (2023-09-18T18:05:06Z) - Scalable Clustering: Large Scale Unsupervised Learning of Gaussian
Mixture Models with Outliers [5.478764356647437]
This paper introduces a provably robust clustering algorithm based on loss minimization.
It provides theoretical guarantees that the algorithm obtains high accuracy with high probability.
Experiments on real-world large-scale datasets demonstrate the effectiveness of the algorithm.
arXiv Detail & Related papers (2023-02-28T14:39:18Z) - Differentially-Private Hierarchical Clustering with Provable
Approximation Guarantees [79.59010418610625]
We study differentially private approximation algorithms for hierarchical clustering.
We show strong lower bounds for the problem: that any $epsilon$-DP algorithm must exhibit $O(|V|2/ epsilon)$-additive error for an input dataset.
We propose a private $1+o(1)$ approximation algorithm which also recovers the blocks exactly.
arXiv Detail & Related papers (2023-01-31T19:14:30Z) - Optimal Clustering by Lloyd Algorithm for Low-Rank Mixture Model [12.868722327487752]
We propose a low-rank mixture model (LrMM) to treat matrix-valued observations.
A computationally efficient clustering method is designed by integrating Lloyd's algorithm and low-rank approximation.
Our method outperforms others in the literature on real-world datasets.
arXiv Detail & Related papers (2022-07-11T03:16:10Z) - Lattice-Based Methods Surpass Sum-of-Squares in Clustering [98.46302040220395]
Clustering is a fundamental primitive in unsupervised learning.
Recent work has established lower bounds against the class of low-degree methods.
We show that, perhaps surprisingly, this particular clustering model textitdoes not exhibit a statistical-to-computational gap.
arXiv Detail & Related papers (2021-12-07T18:50:17Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - Plug-And-Play Learned Gaussian-mixture Approximate Message Passing [71.74028918819046]
We propose a plug-and-play compressed sensing (CS) recovery algorithm suitable for any i.i.d. source prior.
Our algorithm builds upon Borgerding's learned AMP (LAMP), yet significantly improves it by adopting a universal denoising function within the algorithm.
Numerical evaluation shows that the L-GM-AMP algorithm achieves state-of-the-art performance without any knowledge of the source prior.
arXiv Detail & Related papers (2020-11-18T16:40:45Z) - Differentially Private Clustering: Tight Approximation Ratios [57.89473217052714]
We give efficient differentially private algorithms for basic clustering problems.
Our results imply an improved algorithm for the Sample and Aggregate privacy framework.
One of the tools used in our 1-Cluster algorithm can be employed to get a faster quantum algorithm for ClosestPair in a moderate number of dimensions.
arXiv Detail & Related papers (2020-08-18T16:22:06Z) - Randomized Spectral Clustering in Large-Scale Stochastic Block Models [13.366036495177923]
We study the spectral clustering using randomized sketching algorithms from a statistical perspective.
Under mild conditions, the randomized spectral clustering algorithms lead to the same theoretical bounds as those of the original spectral clustering algorithm.
A new R package called Rclust is developed and made available to the public.
arXiv Detail & Related papers (2020-01-20T04:15:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.