Related papers: Consistency of Lloyd's Algorithm Under Perturbations

Consistency of Lloyd's Algorithm Under Perturbations

URL: http://arxiv.org/abs/2309.00578v1
Date: Fri, 1 Sep 2023 16:45:52 GMT
Title: Consistency of Lloyd's Algorithm Under Perturbations
Authors: Dhruv Patel and Hui Shen and Shankar Bhamidi and Yufeng Liu and Vladas Pipiras
Abstract summary: Lloyd's algorithm is one of the most widely used clustering algorithms. We show that the mis-clustering rate of Lloyd's algorithm on samples from a sub-Gaussian mixture is exponentially bounded after $O(log(n))$ iterations.
Score: 11.56433365648479
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the context of unsupervised learning, Lloyd's algorithm is one of the most widely used clustering algorithms. It has inspired a plethora of work investigating the correctness of the algorithm under various settings with ground truth clusters. In particular, in 2016, Lu and Zhou have shown that the mis-clustering rate of Lloyd's algorithm on $n$ independent samples from a sub-Gaussian mixture is exponentially bounded after $O(\log(n))$ iterations, assuming proper initialization of the algorithm. However, in many applications, the true samples are unobserved and need to be learned from the data via pre-processing pipelines such as spectral methods on appropriate data matrices. We show that the mis-clustering rate of Lloyd's algorithm on perturbed samples from a sub-Gaussian mixture is also exponentially bounded after $O(\log(n))$ iterations under the assumptions of proper initialization and that the perturbation is small relative to the sub-Gaussian noise. In canonical settings with ground truth clusters, we derive bounds for algorithms such as $k$-means$++$ to find good initializations and thus leading to the correctness of clustering via the main result. We show the implications of the results for pipelines measuring the statistical significance of derived clusters from data such as SigClust. We use these general results to derive implications in providing theoretical guarantees on the misclustering rate for Lloyd's algorithm in a host of applications, including high-dimensional time series, multi-dimensional scaling, and community detection for sparse networks via spectral clustering.

Related papers

LocalKMeans: Convergence of Lloyd's Algorithm with Distributed Local Iterations [24.939790719971136]
We analyze the classical $K$-means alternating-miniization algorithm, also known as Lloyd's algorithm (Lloyd, 1956)<n>We show that the price paid for the local steps is a higher required signal-to-noise ratio.
arXiv Detail & Related papers (2025-05-23T22:58:40Z)
A Greedy Strategy for Graph Cut [95.2841574410968]
We propose a greedy strategy to solve the problem of Graph Cut, called GGC. It starts from the state where each data sample is regarded as a cluster and dynamically merges the two clusters. GGC has a nearly linear computational complexity with respect to the number of samples.
arXiv Detail & Related papers (2024-12-28T05:49:42Z)
A Sample Efficient Alternating Minimization-based Algorithm For Robust Phase Retrieval [56.67706781191521]
In this work, we present a robust phase retrieval problem where the task is to recover an unknown signal. Our proposed oracle avoids the need for computationally spectral descent, using a simple gradient step and outliers.
arXiv Detail & Related papers (2024-09-07T06:37:23Z)
A Mirror Descent-Based Algorithm for Corruption-Tolerant Distributed Gradient Descent [57.64826450787237]
We show how to analyze the behavior of distributed gradient descent algorithms in the presence of adversarial corruptions. We show how to use ideas from (lazy) mirror descent to design a corruption-tolerant distributed optimization algorithm. Experiments based on linear regression, support vector classification, and softmax classification on the MNIST dataset corroborate our theoretical findings.
arXiv Detail & Related papers (2024-07-19T08:29:12Z)
Heteroskedastic Tensor Clustering [20.979358557219953]
We propose a two-stage method, named $mathsfHightext-orderHeteroClustering$ ($mathsfHHC$) It starts by performing tensor subspace estimation via a novel spectral algorithm called $mathsfThresholdedDeflatedtext-HeteroPCA$, followed by approximate $k$-means to obtain cluster nodes. Our algorithm provably achieves exact clustering as long as the SNR exceeds the computational limit.
arXiv Detail & Related papers (2023-11-04T02:50:40Z)
A Modular Spatial Clustering Algorithm with Noise Specification [0.0]
Bacteria-Farm algorithm is inspired by the growth of bacteria in closed experimental farms. In contrast with other clustering algorithms, our algorithm also has a provision to specify the amount of noise to be excluded during clustering.
arXiv Detail & Related papers (2023-09-18T18:05:06Z)
Scalable Clustering: Large Scale Unsupervised Learning of Gaussian Mixture Models with Outliers [5.478764356647437]
This paper introduces a provably robust clustering algorithm based on loss minimization. It provides theoretical guarantees that the algorithm obtains high accuracy with high probability. Experiments on real-world large-scale datasets demonstrate the effectiveness of the algorithm.
arXiv Detail & Related papers (2023-02-28T14:39:18Z)
Differentially-Private Hierarchical Clustering with Provable Approximation Guarantees [79.59010418610625]
We study differentially private approximation algorithms for hierarchical clustering. We show strong lower bounds for the problem: that any $epsilon$-DP algorithm must exhibit $O(|V|2/ epsilon)$-additive error for an input dataset. We propose a private $1+o(1)$ approximation algorithm which also recovers the blocks exactly.
arXiv Detail & Related papers (2023-01-31T19:14:30Z)
Optimal Clustering by Lloyd Algorithm for Low-Rank Mixture Model [12.868722327487752]
We propose a low-rank mixture model (LrMM) to treat matrix-valued observations. A computationally efficient clustering method is designed by integrating Lloyd's algorithm and low-rank approximation. Our method outperforms others in the literature on real-world datasets.
arXiv Detail & Related papers (2022-07-11T03:16:10Z)
Lattice-Based Methods Surpass Sum-of-Squares in Clustering [98.46302040220395]
Clustering is a fundamental primitive in unsupervised learning. Recent work has established lower bounds against the class of low-degree methods. We show that, perhaps surprisingly, this particular clustering model textitdoes not exhibit a statistical-to-computational gap.
arXiv Detail & Related papers (2021-12-07T18:50:17Z)
Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem. We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z)
Plug-And-Play Learned Gaussian-mixture Approximate Message Passing [71.74028918819046]
We propose a plug-and-play compressed sensing (CS) recovery algorithm suitable for any i.i.d. source prior. Our algorithm builds upon Borgerding's learned AMP (LAMP), yet significantly improves it by adopting a universal denoising function within the algorithm. Numerical evaluation shows that the L-GM-AMP algorithm achieves state-of-the-art performance without any knowledge of the source prior.
arXiv Detail & Related papers (2020-11-18T16:40:45Z)
Differentially Private Clustering: Tight Approximation Ratios [57.89473217052714]
We give efficient differentially private algorithms for basic clustering problems. Our results imply an improved algorithm for the Sample and Aggregate privacy framework. One of the tools used in our 1-Cluster algorithm can be employed to get a faster quantum algorithm for ClosestPair in a moderate number of dimensions.
arXiv Detail & Related papers (2020-08-18T16:22:06Z)
Randomized Spectral Clustering in Large-Scale Stochastic Block Models [13.366036495177923]
We study the spectral clustering using randomized sketching algorithms from a statistical perspective. Under mild conditions, the randomized spectral clustering algorithms lead to the same theoretical bounds as those of the original spectral clustering algorithm. A new R package called Rclust is developed and made available to the public.
arXiv Detail & Related papers (2020-01-20T04:15:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.