No More Than 6ft Apart: Robust K-Means via Radius Upper Bounds
- URL: http://arxiv.org/abs/2203.02502v1
- Date: Fri, 4 Mar 2022 18:59:02 GMT
- Title: No More Than 6ft Apart: Robust K-Means via Radius Upper Bounds
- Authors: Ahmed Imtiaz Humayun, Randall Balestriero, Anastasios Kyrillidis,
Richard Baraniuk
- Abstract summary: Centroid based clustering methods such as k-means, k-medoids and k-centers are heavily applied as a go-to tool in exploratory data analysis.
In many cases, those methods are used to obtain representative centroids of the data manifold for visualization or summarization of a dataset.
We propose to remedy such a scenario by introducing a maximal radius constraint $r$ on the clusters formed by centroids.
- Score: 17.226362076527764
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Centroid based clustering methods such as k-means, k-medoids and k-centers
are heavily applied as a go-to tool in exploratory data analysis. In many
cases, those methods are used to obtain representative centroids of the data
manifold for visualization or summarization of a dataset. Real world datasets
often contain inherent abnormalities, e.g., repeated samples and sampling bias,
that manifest imbalanced clustering. We propose to remedy such a scenario by
introducing a maximal radius constraint $r$ on the clusters formed by the
centroids, i.e., samples from the same cluster should not be more than $2r$
apart in terms of $\ell_2$ distance. We achieve this constraint by solving a
semi-definite program, followed by a linear assignment problem with quadratic
constraints. Through qualitative results, we show that our proposed method is
robust towards dataset imbalances and sampling artifacts. To the best of our
knowledge, ours is the first constrained k-means clustering method with hard
radius constraints. Codes at https://bit.ly/kmeans-constrained
Related papers
- Fuzzy K-Means Clustering without Cluster Centroids [79.19713746387337]
Fuzzy K-Means clustering is a critical computation technique in unsupervised data analysis.
This paper proposes a novel Fuzzy K-Means clustering algorithm that entirely eliminates the reliance on cluster centroids.
arXiv Detail & Related papers (2024-04-07T12:25:03Z) - A Unified Framework for Gradient-based Clustering of Distributed Data [51.904327888475606]
We develop a family of distributed clustering algorithms that work over networks of users.
DGC-$mathcalF_rho$ is specialized to popular clustering losses like $K$-means and Huber loss.
We show that consensus fixed points of DGC-$mathcalF_rho$ are equivalent to fixed points of gradient clustering over the full data.
arXiv Detail & Related papers (2024-02-02T10:44:42Z) - A general theory for robust clustering via trimmed mean [7.650319416775203]
We introduce a hybrid clustering technique with a novel trimmed mean type centroid estimate to produce mislabeling guarantees.
Our results reduce to the sub-Gaussian case when errors follow sub-Gaussian distributions.
We show that these initial centroid estimates are sufficiently good for the subsequent clustering algorithm to achieve the optimal mislabeling rates.
arXiv Detail & Related papers (2024-01-10T22:56:44Z) - Approximating a RUM from Distributions on k-Slates [88.32814292632675]
We find a generalization-time algorithm that finds the RUM that best approximates the given distribution on average.
Our theoretical result can also be made practical: we obtain a that is effective and scales to real-world datasets.
arXiv Detail & Related papers (2023-05-22T17:43:34Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - Sketch-and-solve approaches to k-means clustering by semidefinite
programming [14.930208990741132]
We introduce a sketch-and-solve approach to speed up the Peng-Wei semidefinite relaxation of k-means clustering.
If the data is appropriately separated we identify the k-means optimal clustering.
Otherwise, our approach provides a high-confidence lower bound on the optimal k-means value.
arXiv Detail & Related papers (2022-11-28T19:51:30Z) - Distribution free optimality intervals for clustering [1.7513645771137178]
Given data $mathcalD$ and a partition $mathcalC$ of these data into $K$ clusters, when can we say that the clusters obtained are correct or meaningful for the data?
This paper introduces a paradigm in which a clustering $mathcalC$ is considered meaningful if it is good with respect to a loss function such as the K-means distortion, and stable, i.e. the only good clustering up to small perturbations.
arXiv Detail & Related papers (2021-07-30T06:13:56Z) - (k, l)-Medians Clustering of Trajectories Using Continuous Dynamic Time
Warping [57.316437798033974]
In this work we consider the problem of center-based clustering of trajectories.
We propose the usage of a continuous version of DTW as distance measure, which we call continuous dynamic time warping (CDTW)
We show a practical way to compute a center from a set of trajectories and subsequently iteratively improve it.
arXiv Detail & Related papers (2020-12-01T13:17:27Z) - Quantizing Multiple Sources to a Common Cluster Center: An Asymptotic
Analysis [14.048989759890475]
We consider quantizing an $Ld$-dimensional sample, which is obtained by concatenating $L$ vectors from datasets of $d$-dimensional vectors, to a $d$-dimensional cluster center.
We find a formula for the average performance distortion in the regime where the number of cluster centers are large.
In terms of faithfulness to the original (noiseless) dataset, our clustering approach outperforms the naive approach that relies on quantizing the $Ld$-dimensional noisy observation vectors to $Ld$-dimensional centers.
arXiv Detail & Related papers (2020-10-23T17:14:28Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.