Structures of Spurious Local Minima in $k$-means
- URL: http://arxiv.org/abs/2002.06694v2
- Date: Fri, 21 Feb 2020 21:19:50 GMT
- Title: Structures of Spurious Local Minima in $k$-means
- Authors: Wei Qian, Yuqian Zhang, Yudong Chen
- Abstract summary: We investigate the structures of spurious local solutions to the $k$-means problem.
We prove that this is essentially the only type of spurious local minima under a separation condition.
- Score: 20.155509538529568
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: $k$-means clustering is a fundamental problem in unsupervised learning. The
problem concerns finding a partition of the data points into $k$ clusters such
that the within-cluster variation is minimized. Despite its importance and wide
applicability, a theoretical understanding of the $k$-means problem has not
been completely satisfactory. Existing algorithms with theoretical performance
guarantees often rely on sophisticated (sometimes artificial) algorithmic
techniques and restricted assumptions on the data. The main challenge lies in
the non-convex nature of the problem; in particular, there exist additional
local solutions other than the global optimum. Moreover, the simplest and most
popular algorithm for $k$-means, namely Lloyd's algorithm, generally converges
to such spurious local solutions both in theory and in practice.
In this paper, we approach the $k$-means problem from a new perspective, by
investigating the structures of these spurious local solutions under a
probabilistic generative model with $k$ ground truth clusters. As soon as
$k=3$, spurious local minima provably exist, even for well-separated and
balanced clusters. One such local minimum puts two centers at one true cluster,
and the third center in the middle of the other two true clusters. For general
$k$, one local minimum puts multiple centers at a true cluster, and one center
in the middle of multiple true clusters. Perhaps surprisingly, we prove that
this is essentially the only type of spurious local minima under a separation
condition. Our results pertain to the $k$-means formulation for mixtures of
Gaussians or bounded distributions. Our theoretical results corroborate
existing empirical observations and provide justification for several improved
algorithms for $k$-means clustering.
Related papers
- Relax and Merge: A Simple Yet Effective Framework for Solving Fair $k$-Means and $k$-sparse Wasserstein Barycenter Problems [8.74967598360817]
Given a dataset comprising several groups, the fairness constraint requires that each cluster should contain a proportion of points from each group.
We propose a novel Relax and Merge'' framework, where $rho$ is the approximate ratio of an off-the-shelf vanilla $k$-means algorithm.
If equipped with a PTAS of $k$-means, our solution can achieve an approximation ratio of $(5+O(epsilon))$ with only a slight violation of the fairness constraints.
arXiv Detail & Related papers (2024-11-02T02:50:12Z) - Fair Clustering for Data Summarization: Improved Approximation Algorithms and Complexity Insights [16.120911591795295]
In some applications all data points can be chosen as centers, in the general setting, centers must be chosen from a subset of points, referred as facilities or suppliers.
In this work, we focus on fair data summarization modeled as the fair $k$-supplier problem, where data consists of several groups, and a minimum number of centers must be selected from each group.
arXiv Detail & Related papers (2024-10-16T18:00:19Z) - A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al.
A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$.
We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z) - Are Easy Data Easy (for K-Means) [0.0]
This paper investigates the capability of correctly recovering well-separated clusters by various brands of the $k$-means algorithm.
A new algorithm is proposed that is a variation of $k$-means++ via repeated subsampling when choosing a seed.
arXiv Detail & Related papers (2023-08-02T09:40:19Z) - Scalable Differentially Private Clustering via Hierarchically Separated
Trees [82.69664595378869]
We show that our method computes a solution with cost at most $O(d3/2log n)cdot OPT + O(k d2 log2 n / epsilon2)$, where $epsilon$ is the privacy guarantee.
Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical.
arXiv Detail & Related papers (2022-06-17T09:24:41Z) - Minimax Optimal Algorithms with Fixed-$k$-Nearest Neighbors [13.231906521852718]
We consider a distributed learning scenario in which a massive dataset is split into smaller groups.
We propose emphoptimal rules to aggregate the fixed-$k$-NN information for classification, regression, and density estimation.
We show that the distributed algorithm with a fixed $k$ over a sufficiently large number of groups attains a minimax optimal error rate up to a multiplicative logarithmic factor.
arXiv Detail & Related papers (2022-02-05T01:59:09Z) - Differentially-Private Clustering of Easy Instances [67.04951703461657]
In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points.
We provide implementable differentially private clustering algorithms that provide utility when the data is "easy"
We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results.
arXiv Detail & Related papers (2021-12-29T08:13:56Z) - Lattice-Based Methods Surpass Sum-of-Squares in Clustering [98.46302040220395]
Clustering is a fundamental primitive in unsupervised learning.
Recent work has established lower bounds against the class of low-degree methods.
We show that, perhaps surprisingly, this particular clustering model textitdoes not exhibit a statistical-to-computational gap.
arXiv Detail & Related papers (2021-12-07T18:50:17Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z) - Explainable $k$-Means and $k$-Medians Clustering [25.513261099927163]
We consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner.
We show that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost.
We design an efficient algorithm that produces explainable clusters using a tree with $k$ leaves.
arXiv Detail & Related papers (2020-02-28T04:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.