Comparative analysis of unsupervised clustering techniques using validation metrics: Study on cognitive features from the Canadian Longitudinal Study on Aging (CLSA)
- URL: http://arxiv.org/abs/2504.12270v1
- Date: Mon, 07 Apr 2025 21:13:51 GMT
- Title: Comparative analysis of unsupervised clustering techniques using validation metrics: Study on cognitive features from the Canadian Longitudinal Study on Aging (CLSA)
- Authors: ChenNingZhi Sheng, Rafal Kustra, Davide Chicco,
- Abstract summary: The CLSA dataset includes 18,891 participants with data available at both baseline and follow-up assessments.<n>The clustering methodologies employed in this analysis are K-means (KM) clustering, Hierarchical Clustering (HC) and Partitioning Around Medoids (PAM)<n>Using evaluation metrics to compare the results of the three clustering techniques, K-means and Partitioning Around Medoids (PAM) produced similar results.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Purpose: The primary goal of this study is to explore the application of evaluation metrics to different clustering algorithms using the data provided from the Canadian Longitudinal Study (CLSA), focusing on cognitive features. The objective of our work is to discover potential clinically relevant clusters that contribute to the development of dementia over time-based on cognitive changes. Method: The CLSA dataset includes 18,891 participants with data available at both baseline and follow-up assessments, to which clustering algorithms were applied. The clustering methodologies employed in this analysis are K-means (KM) clustering, Hierarchical Clustering (HC) and Partitioning Around Medoids (PAM). We use multiple evaluation metrics to assess our analysis. For internal evaluation metrics, we use: Average silhouette Width, Within and Between the sum of square Ratio (WB.Ratio), Entropy, Calinski-Harabasz Index (CH Index), and Separation Index. For clustering comparison metrics, we used: Homogeneity, Completeness, Adjusted Rand Index (ARI), Rand Index (RI), and Variation Information. Results: Using evaluation metrics to compare the results of the three clustering techniques, K-means and Partitioning Around Medoids (PAM) produced similar results. In contrast, there are significant differences between K-means clustering and Hierarchical Clustering. Our study highlights the importance of the two internal evaluation metrics: entropy and separation index. In between clustering comparison metrics, the Adjusted Rand Index is a key tool. Conclusion: The study results have the potential to contribute to understanding dementia. Researchers can also benefit by applying the suggested evaluation metrics to other areas of healthcare research. Overall, our study improves the understanding of using clustering techniques and evaluation metrics to reveal complex patterns in medical data.
Related papers
- Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient [0.5939858158928473]
This paper proposes an algorithm named k- SCC to estimate the optimal k in categorical data clustering.
Comparative experiments were conducted on both synthetic and real datasets to compare the performance of k- SCC.
arXiv Detail & Related papers (2025-01-26T14:29:11Z) - Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - ABCDE: Application-Based Cluster Diff Evals [49.1574468325115]
It aims to be practical: it allows items to have associated importance values that are application-specific, it is frugal in its use of human judgements when determining which clustering is better, and it can report metrics for arbitrary slices of items.
The approach to measuring the delta in the clustering quality is novel: instead of trying to construct an expensive ground truth up front and evaluating the each clustering with respect to that, ABCDE samples questions for judgement on the basis of the actual diffs between the clusterings.
arXiv Detail & Related papers (2024-07-31T08:29:35Z) - From A-to-Z Review of Clustering Validation Indices [4.08908337437878]
We review and evaluate the performance of internal and external clustering validation indices on the most common clustering algorithms.
We suggest a classification framework for examining the functionality of both internal and external clustering validation measures.
arXiv Detail & Related papers (2024-07-18T13:52:02Z) - A structured regression approach for evaluating model performance across intersectional subgroups [53.91682617836498]
Disaggregated evaluation is a central task in AI fairness assessment, where the goal is to measure an AI system's performance across different subgroups.
We introduce a structured regression approach to disaggregated evaluation that we demonstrate can yield reliable system performance estimates even for very small subgroups.
arXiv Detail & Related papers (2024-01-26T14:21:45Z) - CLAMS: A Cluster Ambiguity Measure for Estimating Perceptual Variability
in Visual Clustering [23.625877882403227]
We study perceptual variability in conducting visual clustering, which we call Cluster Ambiguity.
We introduce CLAMS, a data-driven visual quality measure for automatically predicting cluster ambiguity in monochrome scatterplots.
arXiv Detail & Related papers (2023-08-01T04:46:35Z) - A Self-Supervised Learning-based Approach to Clustering Multivariate
Time-Series Data with Missing Values (SLAC-Time): An Application to TBI
Phenotyping [8.487912181381404]
We present a Self-supervised Learning-based Approach to Clustering multivariate Time-series data with missing values (SLAC-Time)
SLAC-Time is a Transformer-based clustering method that uses time-series forecasting as a proxy task for leveraging unlabeled data.
Experiments show that SLAC-Time outperforms the baseline K-means clustering algorithm in terms of silhouette coefficient, Calinski Harabasz index, Dunn index, and Davies Bouldin index.
arXiv Detail & Related papers (2023-02-27T01:05:17Z) - Using Representation Expressiveness and Learnability to Evaluate
Self-Supervised Learning Methods [61.49061000562676]
We introduce Cluster Learnability (CL) to assess learnability.
CL is measured in terms of the performance of a KNN trained to predict labels obtained by clustering the representations with K-means.
We find that CL better correlates with in-distribution model performance than other competing recent evaluation schemes.
arXiv Detail & Related papers (2022-06-02T19:05:13Z) - Performance evaluation results of evolutionary clustering algorithm star
for clustering heterogeneous datasets [15.154538450706474]
This article presents the data used to evaluate the performance of evolutionary clustering algorithm star (ECA*)
Two experimental methods are employed to examine the performance of ECA* against five traditional and modern clustering algorithms.
arXiv Detail & Related papers (2021-04-30T08:17:19Z) - Deep Semi-Supervised Embedded Clustering (DSEC) for Stratification of
Heart Failure Patients [50.48904066814385]
In this work we apply deep semi-supervised embedded clustering to determine data-driven patient subgroups of heart failure.
We find clinically relevant clusters from an embedded space derived from heterogeneous data.
The proposed algorithm can potentially find new undiagnosed subgroups of patients that have different outcomes.
arXiv Detail & Related papers (2020-12-24T12:56:46Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.