A testing-based approach to assess the clusterability of categorical
data
- URL: http://arxiv.org/abs/2307.07346v1
- Date: Fri, 14 Jul 2023 13:50:00 GMT
- Title: A testing-based approach to assess the clusterability of categorical
data
- Authors: Lianyu Hu, Junjie Dong, Mudi Jiang, Yan Liu, Zengyou He
- Abstract summary: TestCat is a testing-based approach to assess the clusterability of categorical data in terms of an analytical $p$-value.
We apply our method to a set of benchmark categorical data sets, showing that TestCat outperforms those solutions for numeric data.
- Score: 6.7937877930001775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The objective of clusterability evaluation is to check whether a clustering
structure exists within the data set. As a crucial yet often-overlooked issue
in cluster analysis, it is essential to conduct such a test before applying any
clustering algorithm. If a data set is unclusterable, any subsequent clustering
analysis would not yield valid results. Despite its importance, the majority of
existing studies focus on numerical data, leaving the clusterability evaluation
issue for categorical data as an open problem. Here we present TestCat, a
testing-based approach to assess the clusterability of categorical data in
terms of an analytical $p$-value. The key idea underlying TestCat is that
clusterable categorical data possess many strongly correlated attribute pairs
and hence the sum of chi-squared statistics of all attribute pairs is employed
as the test statistic for $p$-value calculation. We apply our method to a set
of benchmark categorical data sets, showing that TestCat outperforms those
solutions based on existing clusterability evaluation methods for numeric data.
To the best of our knowledge, our work provides the first way to effectively
recognize the clusterability of categorical data in a statistically sound
manner.
Related papers
- Order Is All You Need for Categorical Data Clustering [29.264630563297466]
Categorical data composed of nominal valued attributes are ubiquitous in knowledge discovery and data mining tasks.
Due to the lack of well-defined metric space, categorical data distributions are difficult to intuitively understand.
This paper introduces the new finding that the order relation among attribute values is the decisive factor in clustering accuracy.
arXiv Detail & Related papers (2024-11-19T08:23:25Z) - ABCDE: Application-Based Cluster Diff Evals [49.1574468325115]
It aims to be practical: it allows items to have associated importance values that are application-specific, it is frugal in its use of human judgements when determining which clustering is better, and it can report metrics for arbitrary slices of items.
The approach to measuring the delta in the clustering quality is novel: instead of trying to construct an expensive ground truth up front and evaluating the each clustering with respect to that, ABCDE samples questions for judgement on the basis of the actual diffs between the clusterings.
arXiv Detail & Related papers (2024-07-31T08:29:35Z) - From A-to-Z Review of Clustering Validation Indices [4.08908337437878]
We review and evaluate the performance of internal and external clustering validation indices on the most common clustering algorithms.
We suggest a classification framework for examining the functionality of both internal and external clustering validation measures.
arXiv Detail & Related papers (2024-07-18T13:52:02Z) - Interpretable Clustering with the Distinguishability Criterion [0.4419843514606336]
We present a global criterion called the Distinguishability criterion to quantify the separability of identified clusters and validate inferred cluster configurations.
We propose a combined loss function-based computational framework that integrates the Distinguishability criterion with many commonly used clustering procedures.
We present these new algorithms as well as the results from comprehensive data analysis based on simulation studies and real data applications.
arXiv Detail & Related papers (2024-04-24T16:38:15Z) - A structured regression approach for evaluating model performance across intersectional subgroups [53.91682617836498]
Disaggregated evaluation is a central task in AI fairness assessment, where the goal is to measure an AI system's performance across different subgroups.
We introduce a structured regression approach to disaggregated evaluation that we demonstrate can yield reliable system performance estimates even for very small subgroups.
arXiv Detail & Related papers (2024-01-26T14:21:45Z) - Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels.
We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z) - Significance-Based Categorical Data Clustering [7.421725101465365]
We use the likelihood ratio test to derive a test statistic that can serve as a significance-based objective function in categorical data clustering.
A new clustering algorithm is proposed in which the significance-based objective function is optimized via a Monte Carlo search procedure.
arXiv Detail & Related papers (2022-11-08T02:06:31Z) - Detection and Evaluation of Clusters within Sequential Data [58.720142291102135]
Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees.
In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets.
It is found that the Block Markov Chain model assumption can indeed produce meaningful insights in exploratory data analyses.
arXiv Detail & Related papers (2022-10-04T15:22:39Z) - You Never Cluster Alone [150.94921340034688]
We extend the mainstream contrastive learning paradigm to a cluster-level scheme, where all the data subjected to the same cluster contribute to a unified representation.
We define a set of categorical variables as clustering assignment confidence, which links the instance-level learning track with the cluster-level one.
By reparametrizing the assignment variables, TCC is trained end-to-end, requiring no alternating steps.
arXiv Detail & Related papers (2021-06-03T14:59:59Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Evaluating and Validating Cluster Results [0.0]
In this paper, both external evaluation and internal evaluation are performed on the cluster results of the IRIS dataset.
For internal performance measures, the Silhouette Index and Sum of Square Errors are used.
Finally, as a statistical tool, we used the frequency distribution method to compare and provide a visual representation of the distribution of observations within a clustering result and the original data.
arXiv Detail & Related papers (2020-07-15T23:14:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.