K*-Means: A Parameter-free Clustering Algorithm
- URL: http://arxiv.org/abs/2505.11904v1
- Date: Sat, 17 May 2025 08:41:07 GMT
- Title: K*-Means: A Parameter-free Clustering Algorithm
- Authors: Louis Mahon, Mirella Lapata,
- Abstract summary: k*-means is a novel clustering algorithm that eliminates the need to set k or any other parameters.<n>It uses the minimum description length principle to automatically determine the optimal number of clusters, k*, by splitting and merging clusters.<n>We prove that k*-means is guaranteed to converge and demonstrate experimentally that it significantly outperforms existing methods in scenarios where k is unknown.
- Score: 55.20132267309382
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Clustering is a widely used and powerful machine learning technique, but its effectiveness is often limited by the need to specify the number of clusters, k, or by relying on thresholds that implicitly determine k. We introduce k*-means, a novel clustering algorithm that eliminates the need to set k or any other parameters. Instead, it uses the minimum description length principle to automatically determine the optimal number of clusters, k*, by splitting and merging clusters while also optimising the standard k-means objective. We prove that k*-means is guaranteed to converge and demonstrate experimentally that it significantly outperforms existing methods in scenarios where k is unknown. We also show that it is accurate in estimating k, and that empirically its runtime is competitive with existing methods, and scales well with dataset size.
Related papers
- Radius-Guided Post-Clustering for Shape-Aware, Scalable Refinement of k-Means Results [1.9580473532948401]
After standard k-means, each cluster center is assigned a radius (the distance to its assigned point), and clusters whose radii overlap are merged.<n>This post-processing step loosens the requirement for exact k long as k is.<n>The method can often reconstruct non-estimated shapes over meaningful merges.
arXiv Detail & Related papers (2025-04-28T22:30:53Z) - Counterfactual Explanations for k-means and Gaussian Clustering [1.8561812622368767]
We present a general definition for counterfactuals for model-based clustering that includes plausibility and feasibility constraints.<n>Our approach takes as input the factual, the target cluster, a binary mask indicating actionable or immutable features and a plausibility factor specifying how far from the cluster boundary the counterfactual should be placed.
arXiv Detail & Related papers (2025-01-17T14:56:20Z) - Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - Sketch-and-solve approaches to k-means clustering by semidefinite
programming [14.930208990741132]
We introduce a sketch-and-solve approach to speed up the Peng-Wei semidefinite relaxation of k-means clustering.
If the data is appropriately separated we identify the k-means optimal clustering.
Otherwise, our approach provides a high-confidence lower bound on the optimal k-means value.
arXiv Detail & Related papers (2022-11-28T19:51:30Z) - A One-shot Framework for Distributed Clustered Learning in Heterogeneous
Environments [54.172993875654015]
The paper proposes a family of communication efficient methods for distributed learning in heterogeneous environments.
One-shot approach, based on local computations at the users and a clustering based aggregation step at the server is shown to provide strong learning guarantees.
For strongly convex problems it is shown that, as long as the number of data points per user is above a threshold, the proposed approach achieves order-optimal mean-squared error rates in terms of the sample size.
arXiv Detail & Related papers (2022-09-22T09:04:10Z) - K-Splits: Improved K-Means Clustering Algorithm to Automatically Detect
the Number of Clusters [0.12313056815753944]
This paper introduces k-splits, an improved hierarchical algorithm based on k-means to cluster data without prior knowledge of the number of clusters.
Accuracy and speed are two main advantages of the proposed method.
arXiv Detail & Related papers (2021-10-09T23:02:57Z) - Robust Trimmed k-means [70.88503833248159]
We propose Robust Trimmed k-means (RTKM) that simultaneously identifies outliers and clusters points.
We show RTKM performs competitively with other methods on single membership data with outliers and multi-membership data without outliers.
arXiv Detail & Related papers (2021-08-16T15:49:40Z) - Too Much Information Kills Information: A Clustering Perspective [6.375668163098171]
We propose a simple, but novel approach for variance-based k-clustering tasks, including in which is the widely known k-means clustering.
The proposed approach picks a sampling subset from the given dataset and makes decisions based on the data information in the subset only.
With certain assumptions, the resulting clustering is provably good to estimate the optimum of the variance-based objective with high probability.
arXiv Detail & Related papers (2020-09-16T01:54:26Z) - Differentially Private Clustering: Tight Approximation Ratios [57.89473217052714]
We give efficient differentially private algorithms for basic clustering problems.
Our results imply an improved algorithm for the Sample and Aggregate privacy framework.
One of the tools used in our 1-Cluster algorithm can be employed to get a faster quantum algorithm for ClosestPair in a moderate number of dimensions.
arXiv Detail & Related papers (2020-08-18T16:22:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.