Sanity Check for External Clustering Validation Benchmarks using
Internal Validation Measures
- URL: http://arxiv.org/abs/2209.10042v1
- Date: Tue, 20 Sep 2022 23:32:18 GMT
- Title: Sanity Check for External Clustering Validation Benchmarks using
Internal Validation Measures
- Authors: Hyeon Jeon, Michael Aupetit, DongHwa Shin, Aeri Cho, Seokhyeon Park,
Jinwook Seo
- Abstract summary: We address the lack of reliability in benchmarking clustering techniques based on labeled datasets.
We propose a principled way to generate between-dataset internal measures that enable the comparison of CLM across datasets.
- Score: 8.808021343665319
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We address the lack of reliability in benchmarking clustering techniques
based on labeled datasets. A standard scheme in external clustering validation
is to use class labels as ground truth clusters, based on the assumption that
each class forms a single, clearly separated cluster. However, as such
cluster-label matching (CLM) assumption often breaks, the lack of conducting a
sanity check for the CLM of benchmark datasets casts doubt on the validity of
external validations. Still, evaluating the degree of CLM is challenging. For
example, internal clustering validation measures can be used to quantify CLM
within the same dataset to evaluate its different clusterings but are not
designed to compare clusterings of different datasets. In this work, we propose
a principled way to generate between-dataset internal measures that enable the
comparison of CLM across datasets. We first determine four axioms for
between-dataset internal measures, complementing Ackerman and Ben-David's
within-dataset axioms. We then propose processes to generalize internal
measures to fulfill these new axioms, and use them to extend the widely used
Calinski-Harabasz index for between-dataset CLM evaluation. Through
quantitative experiments, we (1) verify the validity and necessity of the
generalization processes and (2) show that the proposed between-dataset
Calinski-Harabasz index accurately evaluates CLM across datasets. Finally, we
demonstrate the importance of evaluating CLM of benchmark datasets before
conducting external validation.
Related papers
- Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - From A-to-Z Review of Clustering Validation Indices [4.08908337437878]
We review and evaluate the performance of internal and external clustering validation indices on the most common clustering algorithms.
We suggest a classification framework for examining the functionality of both internal and external clustering validation measures.
arXiv Detail & Related papers (2024-07-18T13:52:02Z) - Large Language Models Enable Few-Shot Clustering [88.06276828752553]
We show that large language models can amplify an expert's guidance to enable query-efficient, few-shot semi-supervised text clustering.
We find incorporating LLMs in the first two stages can routinely provide significant improvements in cluster quality.
arXiv Detail & Related papers (2023-07-02T09:17:11Z) - Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels.
We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z) - Clustering Validation with The Area Under Precision-Recall Curves [0.0]
Clustering Validation Index (CVI) allows for clustering validation in real application scenarios.
We show that these are not only appropriate as CVIs, but should also be preferred in the presence of cluster imbalance.
We perform a comprehensive evaluation of proposed and state-of-art CVIs on real and simulated data sets.
arXiv Detail & Related papers (2023-04-04T01:49:57Z) - CEREAL: Few-Sample Clustering Evaluation [4.569028973407756]
We focus on the underexplored problem of estimating clustering quality with limited labels.
We introduce CEREAL, a comprehensive framework for few-sample clustering evaluation.
Our results show that CEREAL reduces the area under the absolute error curve by up to 57% compared to the best sampling baseline.
arXiv Detail & Related papers (2022-09-30T19:52:41Z) - Robust Trimmed k-means [70.88503833248159]
We propose Robust Trimmed k-means (RTKM) that simultaneously identifies outliers and clusters points.
We show RTKM performs competitively with other methods on single membership data with outliers and multi-membership data without outliers.
arXiv Detail & Related papers (2021-08-16T15:49:40Z) - You Never Cluster Alone [150.94921340034688]
We extend the mainstream contrastive learning paradigm to a cluster-level scheme, where all the data subjected to the same cluster contribute to a unified representation.
We define a set of categorical variables as clustering assignment confidence, which links the instance-level learning track with the cluster-level one.
By reparametrizing the assignment variables, TCC is trained end-to-end, requiring no alternating steps.
arXiv Detail & Related papers (2021-06-03T14:59:59Z) - Contrastive Clustering [57.71729650297379]
We propose Contrastive Clustering (CC) which explicitly performs the instance- and cluster-level contrastive learning.
In particular, CC achieves an NMI of 0.705 (0.431) on the CIFAR-10 (CIFAR-100) dataset, which is an up to 19% (39%) performance improvement compared with the best baseline.
arXiv Detail & Related papers (2020-09-21T08:54:40Z) - reval: a Python package to determine best clustering solutions with
stability-based relative clustering validation [1.8129328638036126]
reval is a Python package that leverages stability-based relative clustering validation methods to determine best clustering solutions.
This work aims at developing a stability-based method that selects the best clustering solution as the one that replicates, via supervised learning, on unseen subsets of data.
arXiv Detail & Related papers (2020-08-27T10:36:56Z) - Evaluating and Validating Cluster Results [0.0]
In this paper, both external evaluation and internal evaluation are performed on the cluster results of the IRIS dataset.
For internal performance measures, the Silhouette Index and Sum of Square Errors are used.
Finally, as a statistical tool, we used the frequency distribution method to compare and provide a visual representation of the distribution of observations within a clustering result and the original data.
arXiv Detail & Related papers (2020-07-15T23:14:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.