On the Use of Relative Validity Indices for Comparing Clustering Approaches
- URL: http://arxiv.org/abs/2404.10351v1
- Date: Tue, 16 Apr 2024 07:39:54 GMT
- Title: On the Use of Relative Validity Indices for Comparing Clustering Approaches
- Authors: Luke W. Yerbury, Ricardo J. G. B. Campello, G. C. Livingston Jr, Mark Goldsworthy, Lachlan O'Neil,
- Abstract summary: Relative Validity Indices (RVIs) are popular tools for evaluating and optimising applications of clustering.
Many examples can be found in the literature where RVIs have been used to compare and select other aspects of clustering approaches.
In this study, we conducted experiments with seven common RVIs on over 2.7 million clustering partitions for both synthetic and real-world datasets.
- Score: 0.6990493129893111
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Relative Validity Indices (RVIs) such as the Silhouette Width Criterion, Calinski-Harabasz and Davie's Bouldin indices are the most popular tools for evaluating and optimising applications of clustering. Their ability to rank collections of candidate partitions has been used to guide the selection of the number of clusters, and to compare partitions from different clustering algorithms. Beyond these more conventional tasks, many examples can be found in the literature where RVIs have been used to compare and select other aspects of clustering approaches such as data normalisation procedures, data representation methods, and distance measures. The authors are not aware of any studies that have attempted to establish the suitability of RVIs for such comparisons. Moreover, given the impact of these aspects on pairwise similarities, it is not even immediately obvious how RVIs should be implemented when comparing these aspects. In this study, we conducted experiments with seven common RVIs on over 2.7 million clustering partitions for both synthetic and real-world datasets, encompassing feature-vector and time-series data. Our findings suggest that RVIs are not well-suited to these unconventional tasks, and that conclusions drawn from such applications may be misleading. It is recommended that normalisation procedures, representation methods, and distance measures instead be selected using external validation on high quality labelled datasets or carefully designed outcome-oriented objective criteria, both of which should be informed by relevant domain knowledge and clustering aims.
Related papers
- A Bayesian cluster validity index [0.0]
Cluster validity indices (CVIs) are designed to identify the optimal number of clusters within a dataset.
We introduce a Bayesian cluster validity index (BCVI) which builds upon existing indices.
Our BCVI offers clear advantages in situations where user expertise is valuable, allowing users to specify their desired range for the final number of clusters.
arXiv Detail & Related papers (2024-02-03T14:23:36Z) - In Search of Insights, Not Magic Bullets: Towards Demystification of the
Model Selection Dilemma in Heterogeneous Treatment Effect Estimation [92.51773744318119]
This paper empirically investigates the strengths and weaknesses of different model selection criteria.
We highlight that there is a complex interplay between selection strategies, candidate estimators and the data used for comparing them.
arXiv Detail & Related papers (2023-02-06T16:55:37Z) - Robust Consensus Clustering and its Applications for Advertising
Forecasting [18.242055675730253]
We propose a novel algorithm -- robust consensus clustering that can find common ground truth among experts' opinions.
We apply the proposed method to the real-world advertising campaign segmentation and forecasting tasks.
arXiv Detail & Related papers (2022-12-27T21:49:04Z) - A One-shot Framework for Distributed Clustered Learning in Heterogeneous
Environments [54.172993875654015]
The paper proposes a family of communication efficient methods for distributed learning in heterogeneous environments.
One-shot approach, based on local computations at the users and a clustering based aggregation step at the server is shown to provide strong learning guarantees.
For strongly convex problems it is shown that, as long as the number of data points per user is above a threshold, the proposed approach achieves order-optimal mean-squared error rates in terms of the sample size.
arXiv Detail & Related papers (2022-09-22T09:04:10Z) - Riemannian classification of EEG signals with missing values [67.90148548467762]
This paper proposes two strategies to handle missing data for the classification of electroencephalograms.
The first approach estimates the covariance from imputed data with the $k$-nearest neighbors algorithm; the second relies on the observed data by leveraging the observed-data likelihood within an expectation-maximization algorithm.
As results show, the proposed strategies perform better than the classification based on observed data and allow to keep a high accuracy even when the missing data ratio increases.
arXiv Detail & Related papers (2021-10-19T14:24:50Z) - Clustering performance analysis using new correlation based cluster
validity indices [0.0]
We develop two new cluster validity indices based on a correlation between an actual distance between a pair of data points and a centroid distance of clusters that the two points locate in.
Our proposed indices constantly yield several peaks at different numbers of clusters which overcome the weakness previously stated.
arXiv Detail & Related papers (2021-09-23T06:59:41Z) - Doing Great at Estimating CATE? On the Neglected Assumptions in
Benchmark Comparisons of Treatment Effect Estimators [91.3755431537592]
We show that even in arguably the simplest setting, estimation under ignorability assumptions can be misleading.
We consider two popular machine learning benchmark datasets for evaluation of heterogeneous treatment effect estimators.
We highlight that the inherent characteristics of the benchmark datasets favor some algorithms over others.
arXiv Detail & Related papers (2021-07-28T13:21:27Z) - Graph Contrastive Clustering [131.67881457114316]
We propose a novel graph contrastive learning framework, which is then applied to the clustering task and we come up with the Graph Constrastive Clustering(GCC) method.
Specifically, on the one hand, the graph Laplacian based contrastive loss is proposed to learn more discriminative and clustering-friendly features.
On the other hand, a novel graph-based contrastive learning strategy is proposed to learn more compact clustering assignments.
arXiv Detail & Related papers (2021-04-03T15:32:49Z) - Near-Optimal Comparison Based Clustering [7.930242839366938]
We show that our method can recover a planted clustering using a near-optimal number of comparisons.
We empirically validate our theoretical findings and demonstrate the good behaviour of our method on real data.
arXiv Detail & Related papers (2020-10-08T12:03:13Z) - Clustering Binary Data by Application of Combinatorial Optimization
Heuristics [52.77024349608834]
We study clustering methods for binary data, first defining aggregation criteria that measure the compactness of clusters.
Five new and original methods are introduced, using neighborhoods and population behavior optimization metaheuristics.
From a set of 16 data tables generated by a quasi-Monte Carlo experiment, a comparison is performed for one of the aggregations using L1 dissimilarity, with hierarchical clustering, and a version of k-means: partitioning around medoids or PAM.
arXiv Detail & Related papers (2020-01-06T23:33:31Z) - On clustering uncertain and structured data with Wasserstein barycenters
and a geodesic criterion for the number of clusters [0.0]
This work considers the notion of Wasserstein barycenters, accompanied by appropriate clustering indices based on the intrinsic geometry of the Wasserstein space where the clustering task is performed.
Such type of clustering approaches are highly appreciated in many fields where the observational/experimental error is significant.
Under this perspective, each observation is identified by an appropriate probability measure and the proposed clustering schemes rely on discrimination criteria.
arXiv Detail & Related papers (2019-12-26T08:46:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.