Measuring the Validity of Clustering Validation Datasets
- URL: http://arxiv.org/abs/2503.01097v1
- Date: Mon, 03 Mar 2025 01:54:04 GMT
- Title: Measuring the Validity of Clustering Validation Datasets
- Authors: Hyeon Jeon, Michaƫl Aupetit, DongHwa Shin, Aeri Cho, Seokhyeon Park, Jinwook Seo,
- Abstract summary: Internal validation measures (IVMs) can compare cluster-label matching (CLM) over different labeling of the same dataset, but are not designed to do so across different datasets.<n>We introduce Adjusted IVMs as fast and reliable methods to evaluate and compare CLM across datasets.<n>We show that adjusted IVMs outperform the competitors, including standard IVMs, in accurately evaluating CLM both within and across datasets.
- Score: 9.451764507106027
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Clustering techniques are often validated using benchmark datasets where class labels are used as ground-truth clusters. However, depending on the datasets, class labels may not align with the actual data clusters, and such misalignment hampers accurate validation. Therefore, it is essential to evaluate and compare datasets regarding their cluster-label matching (CLM), i.e., how well their class labels match actual clusters. Internal validation measures (IVMs), like Silhouette, can compare CLM over different labeling of the same dataset, but are not designed to do so across different datasets. We thus introduce Adjusted IVMs as fast and reliable methods to evaluate and compare CLM across datasets. We establish four axioms that require validation measures to be independent of data properties not related to cluster structure (e.g., dimensionality, dataset size). Then, we develop standardized protocols to convert any IVM to satisfy these axioms, and use these protocols to adjust six widely used IVMs. Quantitative experiments (1) verify the necessity and effectiveness of our protocols and (2) show that adjusted IVMs outperform the competitors, including standard IVMs, in accurately evaluating CLM both within and across datasets. We also show that the datasets can be filtered or improved using our method to form more reliable benchmarks for clustering validation.
Related papers
- Text Clustering as Classification with LLMs [6.030435811868953]
This study presents a novel framework for text clustering that effectively leverages the in-context learning capacity of Large Language Models (LLMs)
Instead of fine-tuning embedders, we propose to transform the text clustering into a classification task via LLM.
Our framework has been experimentally proven to achieve comparable or superior performance to state-of-the-art clustering methods.
arXiv Detail & Related papers (2024-09-30T16:57:34Z) - Can an unsupervised clustering algorithm reproduce a categorization system? [1.0485739694839669]
We investigate whether unsupervised clustering can reproduce ground truth classes in a labeled dataset.
We show that success depends on feature selection and the chosen distance metric.
arXiv Detail & Related papers (2024-08-19T18:27:14Z) - DREW : Towards Robust Data Provenance by Leveraging Error-Controlled Watermarking [58.37644304554906]
We propose Data Retrieval with Error-corrected codes and Watermarking (DREW)
DREW randomly clusters the reference dataset and injects unique error-controlled watermark keys into each cluster.
After locating the relevant cluster, embedding vector similarity retrieval is performed within the cluster to find the most accurate matches.
arXiv Detail & Related papers (2024-06-05T01:19:44Z) - Generalized Category Discovery with Clustering Assignment Consistency [56.92546133591019]
Generalized category discovery (GCD) is a recently proposed open-world task.
We propose a co-training-based framework that encourages clustering consistency.
Our method achieves state-of-the-art performance on three generic benchmarks and three fine-grained visual recognition datasets.
arXiv Detail & Related papers (2023-10-30T00:32:47Z) - Twin Contrastive Learning for Online Clustering [15.9794051341163]
This paper proposes to perform online clustering by conducting twin contrastive learning (TCL) at the instance and cluster level.
We find that when the data is projected into a feature space with a dimensionality of the target cluster number, the rows and columns of its feature matrix correspond to the instance and cluster representation.
arXiv Detail & Related papers (2022-10-21T02:12:48Z) - Sanity Check for External Clustering Validation Benchmarks using
Internal Validation Measures [8.808021343665319]
We address the lack of reliability in benchmarking clustering techniques based on labeled datasets.
We propose a principled way to generate between-dataset internal measures that enable the comparison of CLM across datasets.
arXiv Detail & Related papers (2022-09-20T23:32:18Z) - Robust Trimmed k-means [70.88503833248159]
We propose Robust Trimmed k-means (RTKM) that simultaneously identifies outliers and clusters points.
We show RTKM performs competitively with other methods on single membership data with outliers and multi-membership data without outliers.
arXiv Detail & Related papers (2021-08-16T15:49:40Z) - You Never Cluster Alone [150.94921340034688]
We extend the mainstream contrastive learning paradigm to a cluster-level scheme, where all the data subjected to the same cluster contribute to a unified representation.
We define a set of categorical variables as clustering assignment confidence, which links the instance-level learning track with the cluster-level one.
By reparametrizing the assignment variables, TCC is trained end-to-end, requiring no alternating steps.
arXiv Detail & Related papers (2021-06-03T14:59:59Z) - ORDisCo: Effective and Efficient Usage of Incremental Unlabeled Data for
Semi-supervised Continual Learning [52.831894583501395]
Continual learning assumes the incoming data are fully labeled, which might not be applicable in real applications.
We propose deep Online Replay with Discriminator Consistency (ORDisCo) to interdependently learn a classifier with a conditional generative adversarial network (GAN)
We show ORDisCo achieves significant performance improvement on various semi-supervised learning benchmark datasets for SSCL.
arXiv Detail & Related papers (2021-01-02T09:04:14Z) - Contrastive Clustering [57.71729650297379]
We propose Contrastive Clustering (CC) which explicitly performs the instance- and cluster-level contrastive learning.
In particular, CC achieves an NMI of 0.705 (0.431) on the CIFAR-10 (CIFAR-100) dataset, which is an up to 19% (39%) performance improvement compared with the best baseline.
arXiv Detail & Related papers (2020-09-21T08:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.