Normalised clustering accuracy: An asymmetric external cluster validity measure
- URL: http://arxiv.org/abs/2209.02935v4
- Date: Thu, 25 Jul 2024 14:31:03 GMT
- Title: Normalised clustering accuracy: An asymmetric external cluster validity measure
- Authors: Marek Gagolewski,
- Abstract summary: Clustering algorithms are traditionally evaluated using either internal or external validity measures.
In this paper, we argue that the commonly used classical partition similarity scores miss some desirable properties.
We propose and analyse a new measure: a version of the optimal set-matching accuracy.
- Score: 2.900810893770134
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There is no, nor will there ever be, single best clustering algorithm. Nevertheless, we would still like to be able to distinguish between methods that work well on certain task types and those that systematically underperform. Clustering algorithms are traditionally evaluated using either internal or external validity measures. Internal measures quantify different aspects of the obtained partitions, e.g., the average degree of cluster compactness or point separability. However, their validity is questionable because the clusterings they endorse can sometimes be meaningless. External measures, on the other hand, compare the algorithms' outputs to fixed ground truth groupings provided by experts. In this paper, we argue that the commonly used classical partition similarity scores, such as the normalised mutual information, Fowlkes-Mallows, or adjusted Rand index, miss some desirable properties. In particular, they do not identify worst-case scenarios correctly, nor are they easily interpretable. As a consequence, the evaluation of clustering algorithms on diverse benchmark datasets can be difficult. To remedy these issues, we propose and analyse a new measure: a version of the optimal set-matching accuracy, which is normalised, monotonic with respect to some similarity relation, scale-invariant, and corrected for the imbalancedness of cluster sizes (but neither symmetric nor adjusted for chance).
Related papers
- Can an unsupervised clustering algorithm reproduce a categorization system? [1.0485739694839669]
We investigate whether unsupervised clustering can reproduce ground truth classes in a labeled dataset.
We show that success depends on feature selection and the chosen distance metric.
arXiv Detail & Related papers (2024-08-19T18:27:14Z) - ABCDE: Application-Based Cluster Diff Evals [49.1574468325115]
It aims to be practical: it allows items to have associated importance values that are application-specific, it is frugal in its use of human judgements when determining which clustering is better, and it can report metrics for arbitrary slices of items.
The approach to measuring the delta in the clustering quality is novel: instead of trying to construct an expensive ground truth up front and evaluating the each clustering with respect to that, ABCDE samples questions for judgement on the basis of the actual diffs between the clusterings.
arXiv Detail & Related papers (2024-07-31T08:29:35Z) - Differentially-Private Clustering of Easy Instances [67.04951703461657]
In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points.
We provide implementable differentially private clustering algorithms that provide utility when the data is "easy"
We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results.
arXiv Detail & Related papers (2021-12-29T08:13:56Z) - Anomaly Clustering: Grouping Images into Coherent Clusters of Anomaly
Types [60.45942774425782]
We introduce anomaly clustering, whose goal is to group data into coherent clusters of anomaly types.
This is different from anomaly detection, whose goal is to divide anomalies from normal data.
We present a simple yet effective clustering framework using a patch-based pretrained deep embeddings and off-the-shelf clustering methods.
arXiv Detail & Related papers (2021-12-21T23:11:33Z) - Selecting the number of clusters, clustering models, and algorithms. A
unifying approach based on the quadratic discriminant score [0.5330240017302619]
We propose a selection rule that allows choosing among many clustering solutions.
The proposed method has the distinctive advantage that it can compare partitions that cannot be compared with other state-of-the-art methods.
arXiv Detail & Related papers (2021-11-03T15:38:58Z) - J-Score: A Robust Measure of Clustering Accuracy [8.33909555155795]
Clustering analysis discovers hidden structures in a data set by partitioning them into disjoint clusters.
Current clustering accuracy measures include overlooking unmatched clusters, biases towards excessive clusters, unstable baselines, and difficult interpretation.
We present a novel accuracy measure, J-score, that addresses these issues.
arXiv Detail & Related papers (2021-09-03T04:43:52Z) - Differentially Private Clustering: Tight Approximation Ratios [57.89473217052714]
We give efficient differentially private algorithms for basic clustering problems.
Our results imply an improved algorithm for the Sample and Aggregate privacy framework.
One of the tools used in our 1-Cluster algorithm can be employed to get a faster quantum algorithm for ClosestPair in a moderate number of dimensions.
arXiv Detail & Related papers (2020-08-18T16:22:06Z) - Decorrelated Clustering with Data Selection Bias [55.91842043124102]
We propose a novel Decorrelation regularized K-Means algorithm (DCKM) for clustering with data selection bias.
Our DCKM algorithm achieves significant performance gains, indicating the necessity of removing unexpected feature correlations induced by selection bias.
arXiv Detail & Related papers (2020-06-29T08:55:50Z) - Selecting the Number of Clusters $K$ with a Stability Trade-off: an
Internal Validation Criterion [0.0]
Clustering stability has emerged as a natural and model-agnostic principle.
We propose a new principle: a good clustering should be stable, and within each cluster, there should exist no stable partition.
arXiv Detail & Related papers (2020-06-15T16:38:48Z) - Clustering Binary Data by Application of Combinatorial Optimization
Heuristics [52.77024349608834]
We study clustering methods for binary data, first defining aggregation criteria that measure the compactness of clusters.
Five new and original methods are introduced, using neighborhoods and population behavior optimization metaheuristics.
From a set of 16 data tables generated by a quasi-Monte Carlo experiment, a comparison is performed for one of the aggregations using L1 dissimilarity, with hierarchical clustering, and a version of k-means: partitioning around medoids or PAM.
arXiv Detail & Related papers (2020-01-06T23:33:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.