Related papers: Sketch-and-Lift: Scalable Subsampled Semidefinite Program for $K$-means Clustering

Sketch-and-Lift: Scalable Subsampled Semidefinite Program for $K$-means Clustering

URL: http://arxiv.org/abs/2201.08226v1
Date: Thu, 20 Jan 2022 15:31:28 GMT
Title: Sketch-and-Lift: Scalable Subsampled Semidefinite Program for $K$-means Clustering
Authors: Yubo Zhuang, Xiaohui Chen, Yun Yang
Abstract summary: Semidefinite programming (SDP) is a powerful tool for tackling a wide range of computationally hard problems such as clustering. We introduce a linear time complexity algorithm for approximating an SDP relaxed $K$-means clustering. Our simulation experiments demonstrate that the statistical accuracy of the proposed method outperforms state-of-the-art fast clustering algorithms.
Score: 16.153709556346417
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Semidefinite programming (SDP) is a powerful tool for tackling a wide range of computationally hard problems such as clustering. Despite the high accuracy, semidefinite programs are often too slow in practice with poor scalability on large (or even moderate) datasets. In this paper, we introduce a linear time complexity algorithm for approximating an SDP relaxed $K$-means clustering. The proposed sketch-and-lift (SL) approach solves an SDP on a subsampled dataset and then propagates the solution to all data points by a nearest-centroid rounding procedure. It is shown that the SL approach enjoys a similar exact recovery threshold as the $K$-means SDP on the full dataset, which is known to be information-theoretically tight under the Gaussian mixture model. The SL method can be made adaptive with enhanced theoretic properties when the cluster sizes are unbalanced. Our simulation experiments demonstrate that the statistical accuracy of the proposed method outperforms state-of-the-art fast clustering algorithms without sacrificing too much computational efficiency, and is comparable to the original $K$-means SDP with substantially reduced runtime.

Related papers

Statistically Optimal K-means Clustering via Nonnegative Low-rank Semidefinite Programming [25.210724274471914]
$K$-means clustering is a widely used machine learning method for identifying patterns in large datasets. In this paper, we consider an NMF-like algorithm that solves nonnegative low-rank $K$-means factorization problem. Our algorithm achieves significantly smaller mis-clustering errors compared to the existing state-the-art while maintaining scalability.
arXiv Detail & Related papers (2023-05-29T00:39:55Z)
Sketch-and-solve approaches to k-means clustering by semidefinite programming [14.930208990741132]
We introduce a sketch-and-solve approach to speed up the Peng-Wei semidefinite relaxation of k-means clustering. If the data is appropriately separated we identify the k-means optimal clustering. Otherwise, our approach provides a high-confidence lower bound on the optimal k-means value.
arXiv Detail & Related papers (2022-11-28T19:51:30Z)
Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel. Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU. Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z)
Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching. Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z)
Likelihood Adjusted Semidefinite Programs for Clustering Heterogeneous Data [16.153709556346417]
Clustering is a widely deployed learning tool. iLA-SDP is less sensitive than EM to and more stable on high-dimensional data.
arXiv Detail & Related papers (2022-09-29T21:03:13Z)
Matching Pursuit Based Scheduling for Over-the-Air Federated Learning [67.59503935237676]
This paper develops a class of low-complexity device scheduling algorithms for over-the-air learning via the method of federated learning. Compared to the state-of-the-art proposed scheme, the proposed scheme poses a drastically lower efficiency system. The efficiency of the proposed scheme is confirmed via experiments on the CIFAR dataset.
arXiv Detail & Related papers (2022-06-14T08:14:14Z)
Subspace clustering in high-dimensions: Phase transitions \& Statistical-to-Computational gap [24.073221004661427]
A simple model to study subspace clustering is the high-dimensional $k$-Gaussian mixture model. We provide an exact characterization of the statistically optimal reconstruction error in this model in the high-dimensional regime with extensive sparsity.
arXiv Detail & Related papers (2022-05-26T17:47:35Z)
Faster One-Sample Stochastic Conditional Gradient Method for Composite Convex Minimization [61.26619639722804]
We propose a conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms. The proposed method, equipped with an average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques.
arXiv Detail & Related papers (2022-02-26T19:10:48Z)
An Exact Algorithm for Semi-supervised Minimum Sum-of-Squares Clustering [0.5801044612920815]
We present a new branch-and-bound algorithm for semi-supervised MSSC. Background knowledge is incorporated as pairwise must-link and cannot-link constraints. For the first time, the proposed global optimization algorithm efficiently manages to solve real-world instances up to 800 data points.
arXiv Detail & Related papers (2021-11-30T17:08:53Z)
Adaptive Sampling for Best Policy Identification in Markov Decision Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model. The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)
Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA. We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.