Selecting the number of clusters, clustering models, and algorithms. A
unifying approach based on the quadratic discriminant score
- URL: http://arxiv.org/abs/2111.02302v3
- Date: Fri, 11 Aug 2023 15:02:43 GMT
- Title: Selecting the number of clusters, clustering models, and algorithms. A
unifying approach based on the quadratic discriminant score
- Authors: Luca Coraggio and Pietro Coretto
- Abstract summary: We propose a selection rule that allows choosing among many clustering solutions.
The proposed method has the distinctive advantage that it can compare partitions that cannot be compared with other state-of-the-art methods.
- Score: 0.5330240017302619
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cluster analysis requires many decisions: the clustering method and the
implied reference model, the number of clusters and, often, several
hyper-parameters and algorithms' tunings. In practice, one produces several
partitions, and a final one is chosen based on validation or selection
criteria. There exist an abundance of validation methods that, implicitly or
explicitly, assume a certain clustering notion. Moreover, they are often
restricted to operate on partitions obtained from a specific method. In this
paper, we focus on groups that can be well separated by quadratic or linear
boundaries. The reference cluster concept is defined through the quadratic
discriminant score function and parameters describing clusters' size, center
and scatter. We develop two cluster-quality criteria called quadratic scores.
We show that these criteria are consistent with groups generated from a general
class of elliptically-symmetric distributions. The quest for this type of
groups is common in applications. The connection with likelihood theory for
mixture models and model-based clustering is investigated. Based on bootstrap
resampling of the quadratic scores, we propose a selection rule that allows
choosing among many clustering solutions. The proposed method has the
distinctive advantage that it can compare partitions that cannot be compared
with other state-of-the-art methods. Extensive numerical experiments and the
analysis of real data show that, even if some competing methods turn out to be
superior in some setups, the proposed methodology achieves a better overall
performance.
Related papers
- A Computational Theory and Semi-Supervised Algorithm for Clustering [0.0]
A semi-supervised clustering algorithm is presented.
The kernel of the clustering method is Mohammad's anomaly detection algorithm.
Results are presented on synthetic and realworld data sets.
arXiv Detail & Related papers (2023-06-12T09:15:58Z) - High-dimensional variable clustering based on maxima of a weakly dependent random process [1.1999555634662633]
We propose a new class of models for variable clustering called Asymptotic Independent block (AI-block) models.
This class of models is identifiable, meaning that there exists a maximal element with a partial order between partitions, allowing for statistical inference.
We also present an algorithm depending on a tuning parameter that recovers the clusters of variables without specifying the number of clusters empha priori.
arXiv Detail & Related papers (2023-02-02T08:24:26Z) - A parallelizable model-based approach for marginal and multivariate
clustering [0.0]
This paper develops a clustering method that takes advantage of the sturdiness of model-based clustering.
We tackle this issue by specifying a finite mixture model per margin that allows each margin to have a different number of clusters.
The proposed approach is computationally appealing as well as more tractable for moderate to high dimensions than a full' (joint) model-based clustering approach.
arXiv Detail & Related papers (2022-12-07T23:54:41Z) - A One-shot Framework for Distributed Clustered Learning in Heterogeneous
Environments [54.172993875654015]
The paper proposes a family of communication efficient methods for distributed learning in heterogeneous environments.
One-shot approach, based on local computations at the users and a clustering based aggregation step at the server is shown to provide strong learning guarantees.
For strongly convex problems it is shown that, as long as the number of data points per user is above a threshold, the proposed approach achieves order-optimal mean-squared error rates in terms of the sample size.
arXiv Detail & Related papers (2022-09-22T09:04:10Z) - clusterBMA: Bayesian model averaging for clustering [1.2021605201770345]
We introduce clusterBMA, a method that enables weighted model averaging across results from unsupervised clustering algorithms.
We use clustering internal validation criteria to develop an approximation of the posterior model probability, used for weighting the results from each model.
In addition to outperforming other ensemble clustering methods on simulated data, clusterBMA offers unique features including probabilistic allocation to averaged clusters.
arXiv Detail & Related papers (2022-09-09T04:55:20Z) - Personalized Federated Learning via Convex Clustering [72.15857783681658]
We propose a family of algorithms for personalized federated learning with locally convex user costs.
The proposed framework is based on a generalization of convex clustering in which the differences between different users' models are penalized.
arXiv Detail & Related papers (2022-02-01T19:25:31Z) - Selective Inference for Hierarchical Clustering [2.3311605203774386]
We propose a selective inference approach to test for a difference in means between two clusters obtained from any clustering method.
Our procedure controls the selective Type I error rate by accounting for the fact that the null hypothesis was generated from the data.
arXiv Detail & Related papers (2020-12-05T03:03:19Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Selective Inference for Latent Block Models [50.83356836818667]
This study provides a selective inference method for latent block models.
We construct a statistical test on a set of row and column cluster memberships of a latent block model.
The proposed exact and approximated tests work effectively, compared to the naive test that did not take the selective bias into account.
arXiv Detail & Related papers (2020-05-27T10:44:19Z) - Conjoined Dirichlet Process [63.89763375457853]
We develop a novel, non-parametric probabilistic biclustering method based on Dirichlet processes to identify biclusters with strong co-occurrence in both rows and columns.
We apply our method to two different applications, text mining and gene expression analysis, and demonstrate that our method improves bicluster extraction in many settings compared to existing approaches.
arXiv Detail & Related papers (2020-02-08T19:41:23Z) - Clustering Binary Data by Application of Combinatorial Optimization
Heuristics [52.77024349608834]
We study clustering methods for binary data, first defining aggregation criteria that measure the compactness of clusters.
Five new and original methods are introduced, using neighborhoods and population behavior optimization metaheuristics.
From a set of 16 data tables generated by a quasi-Monte Carlo experiment, a comparison is performed for one of the aggregations using L1 dissimilarity, with hierarchical clustering, and a version of k-means: partitioning around medoids or PAM.
arXiv Detail & Related papers (2020-01-06T23:33:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.