Statistically Optimal K-means Clustering via Nonnegative Low-rank Semidefinite Programming
- URL: http://arxiv.org/abs/2305.18436v5
- Date: Sat, 13 Apr 2024 04:05:41 GMT
- Title: Statistically Optimal K-means Clustering via Nonnegative Low-rank Semidefinite Programming
- Authors: Yubo Zhuang, Xiaohui Chen, Yun Yang, Richard Y. Zhang,
- Abstract summary: $K$-means clustering is a widely used machine learning method for identifying patterns in large datasets.
In this paper, we consider an NMF-like algorithm that solves nonnegative low-rank $K$-means factorization problem.
Our algorithm achieves significantly smaller mis-clustering errors compared to the existing state-the-art while maintaining scalability.
- Score: 25.210724274471914
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: $K$-means clustering is a widely used machine learning method for identifying patterns in large datasets. Recently, semidefinite programming (SDP) relaxations have been proposed for solving the $K$-means optimization problem, which enjoy strong statistical optimality guarantees. However, the prohibitive cost of implementing an SDP solver renders these guarantees inaccessible to practical datasets. In contrast, nonnegative matrix factorization (NMF) is a simple clustering algorithm widely used by machine learning practitioners, but it lacks a solid statistical underpinning and theoretical guarantees. In this paper, we consider an NMF-like algorithm that solves a nonnegative low-rank restriction of the SDP-relaxed $K$-means formulation using a nonconvex Burer--Monteiro factorization approach. The resulting algorithm is as simple and scalable as state-of-the-art NMF algorithms while also enjoying the same strong statistical optimality guarantees as the SDP. In our experiments, we observe that our algorithm achieves significantly smaller mis-clustering errors compared to the existing state-of-the-art while maintaining scalability.
Related papers
- Tailed Low-Rank Matrix Factorization for Similarity Matrix Completion [14.542166904874147]
Similarity Completion Matrix serves as a fundamental tool at the core of numerous machinelearning tasks.
To address this issue, Similarity Matrix Theoretical (SMC) methods have been proposed, but they suffer complexity.
We present two novel, scalable, and effective algorithms, which investigate the PSD property to guide the estimation process and incorporate non low-rank regularizer to ensure the low-rank solution.
arXiv Detail & Related papers (2024-09-29T04:27:23Z) - Spectral Entry-wise Matrix Estimation for Low-Rank Reinforcement
Learning [53.445068584013896]
We study matrix estimation problems arising in reinforcement learning (RL) with low-rank structure.
In low-rank bandits, the matrix to be recovered specifies the expected arm rewards, and for low-rank Markov Decision Processes (MDPs), it may for example characterize the transition kernel of the MDP.
We show that simple spectral-based matrix estimation approaches efficiently recover the singular subspaces of the matrix and exhibit nearly-minimal entry-wise error.
arXiv Detail & Related papers (2023-10-10T17:06:41Z) - Efficient Model-Free Exploration in Low-Rank MDPs [76.87340323826945]
Low-Rank Markov Decision Processes offer a simple, yet expressive framework for RL with function approximation.
Existing algorithms are either (1) computationally intractable, or (2) reliant upon restrictive statistical assumptions.
We propose the first provably sample-efficient algorithm for exploration in Low-Rank MDPs.
arXiv Detail & Related papers (2023-07-08T15:41:48Z) - Adversarially robust clustering with optimality guarantees [7.0830450321883935]
We consider the problem of clustering data points coming from sub-Gaussian mixtures.
Existing methods that provably achieve the optimal mislabeling error, such as the Lloyd algorithm, are usually vulnerable to outliers.
We propose a simple robust algorithm based on the coordinatewise median that obtains the optimal mislabeling rate even when we allow adversarial outliers to be present.
arXiv Detail & Related papers (2023-06-16T17:17:07Z) - Likelihood Adjusted Semidefinite Programs for Clustering Heterogeneous
Data [16.153709556346417]
Clustering is a widely deployed learning tool.
iLA-SDP is less sensitive than EM to and more stable on high-dimensional data.
arXiv Detail & Related papers (2022-09-29T21:03:13Z) - Large-scale Optimization of Partial AUC in a Range of False Positive
Rates [51.12047280149546]
The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning.
We develop an efficient approximated gradient descent method based on recent practical envelope smoothing technique.
Our proposed algorithm can also be used to minimize the sum of some ranked range loss, which also lacks efficient solvers.
arXiv Detail & Related papers (2022-03-03T03:46:18Z) - Sketch-and-Lift: Scalable Subsampled Semidefinite Program for $K$-means
Clustering [16.153709556346417]
Semidefinite programming (SDP) is a powerful tool for tackling a wide range of computationally hard problems such as clustering.
We introduce a linear time complexity algorithm for approximating an SDP relaxed $K$-means clustering.
Our simulation experiments demonstrate that the statistical accuracy of the proposed method outperforms state-of-the-art fast clustering algorithms.
arXiv Detail & Related papers (2022-01-20T15:31:28Z) - Self-supervised Symmetric Nonnegative Matrix Factorization [82.59905231819685]
Symmetric nonnegative factor matrix (SNMF) has demonstrated to be a powerful method for data clustering.
Inspired by ensemble clustering that aims to seek better clustering results, we propose self-supervised SNMF (S$3$NMF)
We take advantage of the sensitivity to code characteristic of SNMF, without relying on any additional information.
arXiv Detail & Related papers (2021-03-02T12:47:40Z) - Efficient semidefinite-programming-based inference for binary and
multi-class MRFs [83.09715052229782]
We propose an efficient method for computing the partition function or MAP estimate in a pairwise MRF.
We extend semidefinite relaxations from the typical binary MRF to the full multi-class setting, and develop a compact semidefinite relaxation that can again be solved efficiently using the solver.
arXiv Detail & Related papers (2020-12-04T15:36:29Z) - Kernel k-Means, By All Means: Algorithms and Strong Consistency [21.013169939337583]
Kernel $k$ clustering is a powerful tool for unsupervised learning of non-linear data.
In this paper, we generalize results leveraging a general family of means to combat sub-optimal local solutions.
Our algorithm makes use of majorization-minimization (MM) to better solve this non-linear separation problem.
arXiv Detail & Related papers (2020-11-12T16:07:18Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.