Distribution free optimality intervals for clustering
- URL: http://arxiv.org/abs/2107.14442v1
- Date: Fri, 30 Jul 2021 06:13:56 GMT
- Title: Distribution free optimality intervals for clustering
- Authors: Marina Meil\u{a}, Hanyu Zhang
- Abstract summary: Given data $mathcalD$ and a partition $mathcalC$ of these data into $K$ clusters, when can we say that the clusters obtained are correct or meaningful for the data?
This paper introduces a paradigm in which a clustering $mathcalC$ is considered meaningful if it is good with respect to a loss function such as the K-means distortion, and stable, i.e. the only good clustering up to small perturbations.
- Score: 1.7513645771137178
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address the problem of validating the ouput of clustering algorithms.
Given data $\mathcal{D}$ and a partition $\mathcal{C}$ of these data into $K$
clusters, when can we say that the clusters obtained are correct or meaningful
for the data? This paper introduces a paradigm in which a clustering
$\mathcal{C}$ is considered meaningful if it is good with respect to a loss
function such as the K-means distortion, and stable, i.e. the only good
clustering up to small perturbations. Furthermore, we present a generic method
to obtain post-inference guarantees of near-optimality and stability for a
clustering $\mathcal{C}$. The method can be instantiated for a variety of
clustering criteria (also called loss functions) for which convex relaxations
exist. Obtaining the guarantees amounts to solving a convex optimization
problem. We demonstrate the practical relevance of this method by obtaining
guarantees for the K-means and the Normalized Cut clustering criteria on
realistic data sets. We also prove that asymptotic instability implies finite
sample instability w.h.p., allowing inferences about the population
clusterability from a sample. The guarantees do not depend on any
distributional assumptions, but they depend on the data set $\mathcal{D}$
admitting a stable clustering.
Related papers
- Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - A Unified Framework for Center-based Clustering of Distributed Data [46.86543102499174]
We develop a family of distributed center-based clustering algorithms that work over networks of users.
Our framework allows for a broad class of smooth convex loss functions, including popular clustering losses like $K$-means and Huber loss.
For the special case of Bregman losses, we show that our fixed points converge to the set of Lloyd points.
arXiv Detail & Related papers (2024-02-02T10:44:42Z) - Are Easy Data Easy (for K-Means) [0.0]
This paper investigates the capability of correctly recovering well-separated clusters by various brands of the $k$-means algorithm.
A new algorithm is proposed that is a variation of $k$-means++ via repeated subsampling when choosing a seed.
arXiv Detail & Related papers (2023-08-02T09:40:19Z) - Sketch-and-solve approaches to k-means clustering by semidefinite
programming [14.930208990741132]
We introduce a sketch-and-solve approach to speed up the Peng-Wei semidefinite relaxation of k-means clustering.
If the data is appropriately separated we identify the k-means optimal clustering.
Otherwise, our approach provides a high-confidence lower bound on the optimal k-means value.
arXiv Detail & Related papers (2022-11-28T19:51:30Z) - Asymptotics for The $k$-means [0.6091702876917281]
The $k$-means is one of the most important unsupervised learning techniques in statistics and computer science.
The proposed clustering consistency is more appropriate than the previous criterion consistency for the clustering methods.
It is found that the proposed $k$-means method has lower clustering error rates and is more robust to small clusters and outliers.
arXiv Detail & Related papers (2022-11-18T03:36:58Z) - A One-shot Framework for Distributed Clustered Learning in Heterogeneous
Environments [54.172993875654015]
The paper proposes a family of communication efficient methods for distributed learning in heterogeneous environments.
One-shot approach, based on local computations at the users and a clustering based aggregation step at the server is shown to provide strong learning guarantees.
For strongly convex problems it is shown that, as long as the number of data points per user is above a threshold, the proposed approach achieves order-optimal mean-squared error rates in terms of the sample size.
arXiv Detail & Related papers (2022-09-22T09:04:10Z) - No More Than 6ft Apart: Robust K-Means via Radius Upper Bounds [17.226362076527764]
Centroid based clustering methods such as k-means, k-medoids and k-centers are heavily applied as a go-to tool in exploratory data analysis.
In many cases, those methods are used to obtain representative centroids of the data manifold for visualization or summarization of a dataset.
We propose to remedy such a scenario by introducing a maximal radius constraint $r$ on the clusters formed by centroids.
arXiv Detail & Related papers (2022-03-04T18:59:02Z) - Lattice-Based Methods Surpass Sum-of-Squares in Clustering [98.46302040220395]
Clustering is a fundamental primitive in unsupervised learning.
Recent work has established lower bounds against the class of low-degree methods.
We show that, perhaps surprisingly, this particular clustering model textitdoes not exhibit a statistical-to-computational gap.
arXiv Detail & Related papers (2021-12-07T18:50:17Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.