No More Than 6ft Apart: Robust K-Means via Radius Upper Bounds
- URL: http://arxiv.org/abs/2203.02502v1
- Date: Fri, 4 Mar 2022 18:59:02 GMT
- Title: No More Than 6ft Apart: Robust K-Means via Radius Upper Bounds
- Authors: Ahmed Imtiaz Humayun, Randall Balestriero, Anastasios Kyrillidis,
Richard Baraniuk
- Abstract summary: Centroid based clustering methods such as k-means, k-medoids and k-centers are heavily applied as a go-to tool in exploratory data analysis.
In many cases, those methods are used to obtain representative centroids of the data manifold for visualization or summarization of a dataset.
We propose to remedy such a scenario by introducing a maximal radius constraint $r$ on the clusters formed by centroids.
- Score: 17.226362076527764
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Centroid based clustering methods such as k-means, k-medoids and k-centers
are heavily applied as a go-to tool in exploratory data analysis. In many
cases, those methods are used to obtain representative centroids of the data
manifold for visualization or summarization of a dataset. Real world datasets
often contain inherent abnormalities, e.g., repeated samples and sampling bias,
that manifest imbalanced clustering. We propose to remedy such a scenario by
introducing a maximal radius constraint $r$ on the clusters formed by the
centroids, i.e., samples from the same cluster should not be more than $2r$
apart in terms of $\ell_2$ distance. We achieve this constraint by solving a
semi-definite program, followed by a linear assignment problem with quadratic
constraints. Through qualitative results, we show that our proposed method is
robust towards dataset imbalances and sampling artifacts. To the best of our
knowledge, ours is the first constrained k-means clustering method with hard
radius constraints. Codes at https://bit.ly/kmeans-constrained
Related papers
- Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - A provable initialization and robust clustering method for general mixture models [6.806940901668607]
Clustering is a fundamental tool in statistical machine learning in the presence of heterogeneous data.
Most recent results focus on optimal mislabeling guarantees when data are distributed around centroids with sub-Gaussian errors.
arXiv Detail & Related papers (2024-01-10T22:56:44Z) - Approximating a RUM from Distributions on k-Slates [88.32814292632675]
We find a generalization-time algorithm that finds the RUM that best approximates the given distribution on average.
Our theoretical result can also be made practical: we obtain a that is effective and scales to real-world datasets.
arXiv Detail & Related papers (2023-05-22T17:43:34Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - Sketch-and-solve approaches to k-means clustering by semidefinite
programming [14.930208990741132]
We introduce a sketch-and-solve approach to speed up the Peng-Wei semidefinite relaxation of k-means clustering.
If the data is appropriately separated we identify the k-means optimal clustering.
Otherwise, our approach provides a high-confidence lower bound on the optimal k-means value.
arXiv Detail & Related papers (2022-11-28T19:51:30Z) - Anomaly Clustering: Grouping Images into Coherent Clusters of Anomaly
Types [60.45942774425782]
We introduce anomaly clustering, whose goal is to group data into coherent clusters of anomaly types.
This is different from anomaly detection, whose goal is to divide anomalies from normal data.
We present a simple yet effective clustering framework using a patch-based pretrained deep embeddings and off-the-shelf clustering methods.
arXiv Detail & Related papers (2021-12-21T23:11:33Z) - Distribution free optimality intervals for clustering [1.7513645771137178]
Given data $mathcalD$ and a partition $mathcalC$ of these data into $K$ clusters, when can we say that the clusters obtained are correct or meaningful for the data?
This paper introduces a paradigm in which a clustering $mathcalC$ is considered meaningful if it is good with respect to a loss function such as the K-means distortion, and stable, i.e. the only good clustering up to small perturbations.
arXiv Detail & Related papers (2021-07-30T06:13:56Z) - (k, l)-Medians Clustering of Trajectories Using Continuous Dynamic Time
Warping [57.316437798033974]
In this work we consider the problem of center-based clustering of trajectories.
We propose the usage of a continuous version of DTW as distance measure, which we call continuous dynamic time warping (CDTW)
We show a practical way to compute a center from a set of trajectories and subsequently iteratively improve it.
arXiv Detail & Related papers (2020-12-01T13:17:27Z) - Quantizing Multiple Sources to a Common Cluster Center: An Asymptotic
Analysis [14.048989759890475]
We consider quantizing an $Ld$-dimensional sample, which is obtained by concatenating $L$ vectors from datasets of $d$-dimensional vectors, to a $d$-dimensional cluster center.
We find a formula for the average performance distortion in the regime where the number of cluster centers are large.
In terms of faithfulness to the original (noiseless) dataset, our clustering approach outperforms the naive approach that relies on quantizing the $Ld$-dimensional noisy observation vectors to $Ld$-dimensional centers.
arXiv Detail & Related papers (2020-10-23T17:14:28Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.