Clustering Validation with The Area Under Precision-Recall Curves
- URL: http://arxiv.org/abs/2304.01450v1
- Date: Tue, 4 Apr 2023 01:49:57 GMT
- Title: Clustering Validation with The Area Under Precision-Recall Curves
- Authors: Pablo Andretta Jaskowiak and Ivan Gesteira Costa
- Abstract summary: Clustering Validation Index (CVI) allows for clustering validation in real application scenarios.
We show that these are not only appropriate as CVIs, but should also be preferred in the presence of cluster imbalance.
We perform a comprehensive evaluation of proposed and state-of-art CVIs on real and simulated data sets.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Confusion matrices and derived metrics provide a comprehensive framework for
the evaluation of model performance in machine learning. These are well-known
and extensively employed in the supervised learning domain, particularly
classification. Surprisingly, such a framework has not been fully explored in
the context of clustering validation. Indeed, just recently such a gap has been
bridged with the introduction of the Area Under the ROC Curve for Clustering
(AUCC), an internal/relative Clustering Validation Index (CVI) that allows for
clustering validation in real application scenarios. In this work we explore
the Area Under Precision-Recall Curve (and related metrics) in the context of
clustering validation. We show that these are not only appropriate as CVIs, but
should also be preferred in the presence of cluster imbalance. We perform a
comprehensive evaluation of proposed and state-of-art CVIs on real and
simulated data sets. Our observations corroborate towards an unified validation
framework for supervised and unsupervised learning, given that they are
consistent with existing guidelines established for the evaluation of
supervised learning models.
Related papers
- GCC: Generative Calibration Clustering [55.44944397168619]
We propose a novel Generative Clustering (GCC) method to incorporate feature learning and augmentation into clustering procedure.
First, we develop a discrimirative feature alignment mechanism to discover intrinsic relationship across real and generated samples.
Second, we design a self-supervised metric learning to generate more reliable cluster assignment.
arXiv Detail & Related papers (2024-04-14T01:51:11Z) - Sanity Check for External Clustering Validation Benchmarks using
Internal Validation Measures [8.808021343665319]
We address the lack of reliability in benchmarking clustering techniques based on labeled datasets.
We propose a principled way to generate between-dataset internal measures that enable the comparison of CLM across datasets.
arXiv Detail & Related papers (2022-09-20T23:32:18Z) - Using Representation Expressiveness and Learnability to Evaluate
Self-Supervised Learning Methods [61.49061000562676]
We introduce Cluster Learnability (CL) to assess learnability.
CL is measured in terms of the performance of a KNN trained to predict labels obtained by clustering the representations with K-means.
We find that CL better correlates with in-distribution model performance than other competing recent evaluation schemes.
arXiv Detail & Related papers (2022-06-02T19:05:13Z) - ExpertNet: A Symbiosis of Classification and Clustering [22.324813752423044]
ExpertNet uses novel training strategies to learn clustered latent representations and leverage them by effectively combining cluster-specific classifiers.
We demonstrate the superiority of ExpertNet over state-of-the-art methods on 6 large clinical datasets.
arXiv Detail & Related papers (2022-01-17T11:00:30Z) - Deep Conditional Gaussian Mixture Model for Constrained Clustering [7.070883800886882]
Constrained clustering can leverage prior information on a growing amount of only partially labeled data.
We propose a novel framework for constrained clustering that is intuitive, interpretable, and can be trained efficiently in the framework of gradient variational inference.
arXiv Detail & Related papers (2021-06-11T13:38:09Z) - You Never Cluster Alone [150.94921340034688]
We extend the mainstream contrastive learning paradigm to a cluster-level scheme, where all the data subjected to the same cluster contribute to a unified representation.
We define a set of categorical variables as clustering assignment confidence, which links the instance-level learning track with the cluster-level one.
By reparametrizing the assignment variables, TCC is trained end-to-end, requiring no alternating steps.
arXiv Detail & Related papers (2021-06-03T14:59:59Z) - Learning to Generate Fair Clusters from Demonstrations [27.423983748614198]
We show how to identify the intended fairness constraint for a problem based on limited demonstrations from an expert.
We present an algorithm to identify the fairness metric from demonstrations and generate clusters using existing off-the-shelf clustering techniques.
We investigate how to generate interpretable solutions using our approach.
arXiv Detail & Related papers (2021-02-08T03:09:33Z) - Towards Uncovering the Intrinsic Data Structures for Unsupervised Domain
Adaptation using Structurally Regularized Deep Clustering [119.88565565454378]
Unsupervised domain adaptation (UDA) is to learn classification models that make predictions for unlabeled data on a target domain.
We propose a hybrid model of Structurally Regularized Deep Clustering, which integrates the regularized discriminative clustering of target data with a generative one.
Our proposed H-SRDC outperforms all the existing methods under both the inductive and transductive settings.
arXiv Detail & Related papers (2020-12-08T08:52:00Z) - The Area Under the ROC Curve as a Measure of Clustering Quality [0.0]
Area Under the Curve for Clustering (AUCC) is an internal/relative measure of clustering quality.
AUCC is a linear transformation of the Gamma criterion from Baker and Hubert (1975).
arXiv Detail & Related papers (2020-09-04T21:34:51Z) - Structured Graph Learning for Clustering and Semi-supervised
Classification [74.35376212789132]
We propose a graph learning framework to preserve both the local and global structure of data.
Our method uses the self-expressiveness of samples to capture the global structure and adaptive neighbor approach to respect the local structure.
Our model is equivalent to a combination of kernel k-means and k-means methods under certain condition.
arXiv Detail & Related papers (2020-08-31T08:41:20Z) - Exploring Category-Agnostic Clusters for Open-Set Domain Adaptation [138.29273453811945]
We present Self-Ensembling with Category-agnostic Clusters (SE-CC) -- a novel architecture that steers domain adaptation with category-agnostic clusters in target domain.
clustering is performed over all the unlabeled target samples to obtain the category-agnostic clusters, which reveal the underlying data space structure peculiar to target domain.
arXiv Detail & Related papers (2020-06-11T16:19:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.