Structures of Spurious Local Minima in $k$-means
- URL: http://arxiv.org/abs/2002.06694v2
- Date: Fri, 21 Feb 2020 21:19:50 GMT
- Title: Structures of Spurious Local Minima in $k$-means
- Authors: Wei Qian, Yuqian Zhang, Yudong Chen
- Abstract summary: We investigate the structures of spurious local solutions to the $k$-means problem.
We prove that this is essentially the only type of spurious local minima under a separation condition.
- Score: 20.155509538529568
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: $k$-means clustering is a fundamental problem in unsupervised learning. The
problem concerns finding a partition of the data points into $k$ clusters such
that the within-cluster variation is minimized. Despite its importance and wide
applicability, a theoretical understanding of the $k$-means problem has not
been completely satisfactory. Existing algorithms with theoretical performance
guarantees often rely on sophisticated (sometimes artificial) algorithmic
techniques and restricted assumptions on the data. The main challenge lies in
the non-convex nature of the problem; in particular, there exist additional
local solutions other than the global optimum. Moreover, the simplest and most
popular algorithm for $k$-means, namely Lloyd's algorithm, generally converges
to such spurious local solutions both in theory and in practice.
In this paper, we approach the $k$-means problem from a new perspective, by
investigating the structures of these spurious local solutions under a
probabilistic generative model with $k$ ground truth clusters. As soon as
$k=3$, spurious local minima provably exist, even for well-separated and
balanced clusters. One such local minimum puts two centers at one true cluster,
and the third center in the middle of the other two true clusters. For general
$k$, one local minimum puts multiple centers at a true cluster, and one center
in the middle of multiple true clusters. Perhaps surprisingly, we prove that
this is essentially the only type of spurious local minima under a separation
condition. Our results pertain to the $k$-means formulation for mixtures of
Gaussians or bounded distributions. Our theoretical results corroborate
existing empirical observations and provide justification for several improved
algorithms for $k$-means clustering.
Related papers
- A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al.
A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$.
We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z) - A Unified Framework for Gradient-based Clustering of Distributed Data [51.904327888475606]
We develop a family of distributed clustering algorithms that work over networks of users.
DGC-$mathcalF_rho$ is specialized to popular clustering losses like $K$-means and Huber loss.
We show that consensus fixed points of DGC-$mathcalF_rho$ are equivalent to fixed points of gradient clustering over the full data.
arXiv Detail & Related papers (2024-02-02T10:44:42Z) - Are Easy Data Easy (for K-Means) [0.0]
This paper investigates the capability of correctly recovering well-separated clusters by various brands of the $k$-means algorithm.
A new algorithm is proposed that is a variation of $k$-means++ via repeated subsampling when choosing a seed.
arXiv Detail & Related papers (2023-08-02T09:40:19Z) - Global $k$-means$++$: an effective relaxation of the global $k$-means
clustering algorithm [0.20305676256390928]
The $k$-means algorithm is a prevalent clustering method due to its simplicity, effectiveness, and speed.
We propose the emphglobal $k$-meanstexttt++ clustering algorithm, which is an effective way of acquiring quality clustering solutions.
arXiv Detail & Related papers (2022-11-22T13:42:53Z) - Asymptotics for The $k$-means [0.6091702876917281]
The $k$-means is one of the most important unsupervised learning techniques in statistics and computer science.
The proposed clustering consistency is more appropriate than the previous criterion consistency for the clustering methods.
It is found that the proposed $k$-means method has lower clustering error rates and is more robust to small clusters and outliers.
arXiv Detail & Related papers (2022-11-18T03:36:58Z) - Differentially-Private Clustering of Easy Instances [67.04951703461657]
In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points.
We provide implementable differentially private clustering algorithms that provide utility when the data is "easy"
We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results.
arXiv Detail & Related papers (2021-12-29T08:13:56Z) - Lattice-Based Methods Surpass Sum-of-Squares in Clustering [98.46302040220395]
Clustering is a fundamental primitive in unsupervised learning.
Recent work has established lower bounds against the class of low-degree methods.
We show that, perhaps surprisingly, this particular clustering model textitdoes not exhibit a statistical-to-computational gap.
arXiv Detail & Related papers (2021-12-07T18:50:17Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z) - Explainable $k$-Means and $k$-Medians Clustering [25.513261099927163]
We consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner.
We show that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost.
We design an efficient algorithm that produces explainable clusters using a tree with $k$ leaves.
arXiv Detail & Related papers (2020-02-28T04:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.