Clustering by the Probability Distributions from Extreme Value Theory
- URL: http://arxiv.org/abs/2202.09784v1
- Date: Sun, 20 Feb 2022 10:52:43 GMT
- Title: Clustering by the Probability Distributions from Extreme Value Theory
- Authors: Sixiao Zheng, Ke Fan, Yanxi Hou, Jianfeng Feng, and Yanwei Fu
- Abstract summary: This paper generalizes k-means to model the distribution of clusters.
We use GPD to establish a probability model for each cluster.
We also introduce a naive baseline, dubbed as Generalized Extreme Value (GEV) k-means.
Notably, GEV k-means can also estimate cluster structure and thus perform reasonably well over classical k-means.
- Score: 32.496691290725764
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Clustering is an essential task to unsupervised learning. It tries to
automatically separate instances into coherent subsets. As one of the most
well-known clustering algorithms, k-means assigns sample points at the boundary
to a unique cluster, while it does not utilize the information of sample
distribution or density. Comparably, it would potentially be more beneficial to
consider the probability of each sample in a possible cluster. To this end,
this paper generalizes k-means to model the distribution of clusters. Our novel
clustering algorithm thus models the distributions of distances to centroids
over a threshold by Generalized Pareto Distribution (GPD) in Extreme Value
Theory (EVT). Notably, we propose the concept of centroid margin distance, use
GPD to establish a probability model for each cluster, and perform a clustering
algorithm based on the covering probability function derived from GPD. Such a
GPD k-means thus enables the clustering algorithm from the probabilistic
perspective. Correspondingly, we also introduce a naive baseline, dubbed as
Generalized Extreme Value (GEV) k-means. GEV fits the distribution of the block
maxima. In contrast, the GPD fits the distribution of distance to the centroid
exceeding a sufficiently large threshold, leading to a more stable performance
of GPD k-means. Notably, GEV k-means can also estimate cluster structure and
thus perform reasonably well over classical k-means. Thus, extensive
experiments on synthetic datasets and real datasets demonstrate that GPD
k-means outperforms competitors. The github codes are released in
https://github.com/sixiaozheng/EVT-K-means.
Related papers
- Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - A provable initialization and robust clustering method for general mixture models [6.806940901668607]
Clustering is a fundamental tool in statistical machine learning in the presence of heterogeneous data.
Most recent results focus on optimal mislabeling guarantees when data are distributed around centroids with sub-Gaussian errors.
arXiv Detail & Related papers (2024-01-10T22:56:44Z) - UniForCE: The Unimodality Forest Method for Clustering and Estimation of
the Number of Clusters [2.4953699842881605]
We focus on the concept of unimodality and propose a flexible cluster definition called locally unimodal cluster.
A locally unimodal cluster extends for as long as unimodality is locally preserved across pairs of subclusters of the data.
We propose the UniForCE method for locally unimodal clustering.
arXiv Detail & Related papers (2023-12-18T16:19:02Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - A One-shot Framework for Distributed Clustered Learning in Heterogeneous
Environments [54.172993875654015]
The paper proposes a family of communication efficient methods for distributed learning in heterogeneous environments.
One-shot approach, based on local computations at the users and a clustering based aggregation step at the server is shown to provide strong learning guarantees.
For strongly convex problems it is shown that, as long as the number of data points per user is above a threshold, the proposed approach achieves order-optimal mean-squared error rates in terms of the sample size.
arXiv Detail & Related papers (2022-09-22T09:04:10Z) - Wasserstein $K$-means for clustering probability distributions [16.153709556346417]
In the Euclidean space, centroid-based and distance-based formulations of the $K$-means are equivalent.
In modern machine learning applications, data often arise as probability distributions and a natural generalization to handle measure-valued data is to use the optimal transport metric.
We show that the SDP relaxed Wasserstein $K$-means can achieve exact recovery given the clusters are well-separated under the $2$-Wasserstein metric.
arXiv Detail & Related papers (2022-09-14T23:43:16Z) - Determinantal consensus clustering [77.34726150561087]
We propose the use of determinantal point processes or DPP for the random restart of clustering algorithms.
DPPs favor diversity of the center points within subsets.
We show through simulations that, contrary to DPP, this technique fails both to ensure diversity, and to obtain a good coverage of all data facets.
arXiv Detail & Related papers (2021-02-07T23:48:24Z) - Kernel learning approaches for summarising and combining posterior
similarity matrices [68.8204255655161]
We build upon the notion of the posterior similarity matrix (PSM) in order to suggest new approaches for summarising the output of MCMC algorithms for Bayesian clustering models.
A key contribution of our work is the observation that PSMs are positive semi-definite, and hence can be used to define probabilistically-motivated kernel matrices.
arXiv Detail & Related papers (2020-09-27T14:16:14Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z) - Statistical power for cluster analysis [0.0]
Cluster algorithms are increasingly popular in biomedical research.
We estimate power and accuracy for common analysis through simulation.
We recommend that researchers only apply cluster analysis when large subgroup separation is expected.
arXiv Detail & Related papers (2020-03-01T02:43:15Z) - K-bMOM: a robust Lloyd-type clustering algorithm based on bootstrap
Median-of-Means [3.222802562733787]
We propose a new clustering algorithm that is robust to the presence of outliers in the dataset.
We build on the idea of median-of-means statistics to estimate the centroids, but allow for replacement while constructing the blocks.
We prove its robustness to adversarial contamination by deriving robust rates of convergence for the K-means distorsion.
arXiv Detail & Related papers (2020-02-10T16:08:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.