K-bMOM: a robust Lloyd-type clustering algorithm based on bootstrap
Median-of-Means
- URL: http://arxiv.org/abs/2002.03899v2
- Date: Wed, 19 Aug 2020 16:56:21 GMT
- Title: K-bMOM: a robust Lloyd-type clustering algorithm based on bootstrap
Median-of-Means
- Authors: Camille Brunet-Saumard, Edouard Genetay, Adrien Saumard
- Abstract summary: We propose a new clustering algorithm that is robust to the presence of outliers in the dataset.
We build on the idea of median-of-means statistics to estimate the centroids, but allow for replacement while constructing the blocks.
We prove its robustness to adversarial contamination by deriving robust rates of convergence for the K-means distorsion.
- Score: 3.222802562733787
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a new clustering algorithm that is robust to the presence of
outliers in the dataset. We perform Lloyd-type iterations with robust estimates
of the centroids. More precisely, we build on the idea of median-of-means
statistics to estimate the centroids, but allow for replacement while
constructing the blocks. We call this methodology the bootstrap median-of-means
(bMOM) and prove that if enough blocks are generated through the bootstrap
sampling, then it has a better breakdown point for mean estimation than the
classical median-of-means (MOM), where the blocks form a partition of the
dataset. From a clustering perspective, bMOM enables to take many blocks of a
desired size, thus avoiding possible disappearance of clusters in some blocks,
a pitfall that can occur for the partition-based generation of blocks of the
classical median-of-means. Experiments on simulated datasets show that the
proposed approach, called K-bMOM, performs better than existing robust K-means
based methods. Guidelines are provided for tuning the hyper-parameters K-bMOM
in practice. It is also recommended to the practitionner to use such a robust
approach to initialize their clustering algorithm. Finally, considering a
simplified and theoretical version of our estimator, we prove its robustness to
adversarial contamination by deriving robust rates of convergence for the
K-means distorsion. To our knowledge, it is the first result of this kind for
the K-means distorsion.
Related papers
- Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - Fuzzy K-Means Clustering without Cluster Centroids [21.256564324236333]
Fuzzy K-Means clustering is a critical technique in unsupervised data analysis.
This paper proposes a novel Fuzzy textitK-Means clustering algorithm that entirely eliminates the reliance on cluster centroids.
arXiv Detail & Related papers (2024-04-07T12:25:03Z) - Robust and Automatic Data Clustering: Dirichlet Process meets
Median-of-Means [18.3248037914529]
We present an efficient and automatic clustering technique by integrating the principles of model-based and centroid-based methodologies.
Statistical guarantees on the upper bound of clustering error suggest the advantages of our proposed method over existing state-of-the-art clustering algorithms.
arXiv Detail & Related papers (2023-11-26T19:01:15Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - Sketch-and-solve approaches to k-means clustering by semidefinite
programming [14.930208990741132]
We introduce a sketch-and-solve approach to speed up the Peng-Wei semidefinite relaxation of k-means clustering.
If the data is appropriately separated we identify the k-means optimal clustering.
Otherwise, our approach provides a high-confidence lower bound on the optimal k-means value.
arXiv Detail & Related papers (2022-11-28T19:51:30Z) - Rethinking Clustering-Based Pseudo-Labeling for Unsupervised
Meta-Learning [146.11600461034746]
Method for unsupervised meta-learning, CACTUs, is a clustering-based approach with pseudo-labeling.
This approach is model-agnostic and can be combined with supervised algorithms to learn from unlabeled data.
We prove that the core reason for this is lack of a clustering-friendly property in the embedding space.
arXiv Detail & Related papers (2022-09-27T19:04:36Z) - Clustering by the Probability Distributions from Extreme Value Theory [32.496691290725764]
This paper generalizes k-means to model the distribution of clusters.
We use GPD to establish a probability model for each cluster.
We also introduce a naive baseline, dubbed as Generalized Extreme Value (GEV) k-means.
Notably, GEV k-means can also estimate cluster structure and thus perform reasonably well over classical k-means.
arXiv Detail & Related papers (2022-02-20T10:52:43Z) - Meta Clustering Learning for Large-scale Unsupervised Person
Re-identification [124.54749810371986]
We propose a "small data for big task" paradigm dubbed Meta Clustering Learning (MCL)
MCL only pseudo-labels a subset of the entire unlabeled data via clustering to save computing for the first-phase training.
Our method significantly saves computational cost while achieving a comparable or even better performance compared to prior works.
arXiv Detail & Related papers (2021-11-19T04:10:18Z) - Determinantal consensus clustering [77.34726150561087]
We propose the use of determinantal point processes or DPP for the random restart of clustering algorithms.
DPPs favor diversity of the center points within subsets.
We show through simulations that, contrary to DPP, this technique fails both to ensure diversity, and to obtain a good coverage of all data facets.
arXiv Detail & Related papers (2021-02-07T23:48:24Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.