Cluster Metric Sensitivity to Irrelevant Features
- URL: http://arxiv.org/abs/2402.12008v1
- Date: Mon, 19 Feb 2024 10:02:00 GMT
- Title: Cluster Metric Sensitivity to Irrelevant Features
- Authors: Miles McCrory and Spencer A. Thomas
- Abstract summary: We show how different types of irrelevant variables can impact the outcome of a clustering result from $k$-means in different ways.
Our results show that the Silhouette Coefficient and the Davies-Bouldin score are the most sensitive to irrelevant added features.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Clustering algorithms are used extensively in data analysis for data
exploration and discovery. Technological advancements lead to continually
growth of data in terms of volume, dimensionality and complexity. This provides
great opportunities in data analytics as the data can be interrogated for many
different purposes. This however leads challenges, such as identification of
relevant features for a given task. In supervised tasks, one can utilise a
number of methods to optimise the input features for the task objective (e.g.
classification accuracy). In unsupervised problems, such tools are not readily
available, in part due to an inability to quantify feature relevance in
unlabeled tasks. In this paper, we investigate the sensitivity of clustering
performance noisy uncorrelated variables iteratively added to baseline datasets
with well defined clusters. We show how different types of irrelevant variables
can impact the outcome of a clustering result from $k$-means in different ways.
We observe a resilience to very high proportions of irrelevant features for
adjusted rand index (ARI) and normalised mutual information (NMI) when the
irrelevant features are Gaussian distributed. For Uniformly distributed
irrelevant features, we notice the resilience of ARI and NMI is dependent on
the dimensionality of the data and exhibits tipping points between high scores
and near zero. Our results show that the Silhouette Coefficient and the
Davies-Bouldin score are the most sensitive to irrelevant added features
exhibiting large changes in score for comparably low proportions of irrelevant
features regardless of underlying distribution or data scaling. As such the
Silhouette Coefficient and the Davies-Bouldin score are good candidates for
optimising feature selection in unsupervised clustering tasks.
Related papers
- K-means Derived Unsupervised Feature Selection using Improved ADMM [25.145984747164256]
This paper presents a novel method called K-means Derived Unsupervised Feature Selection (K-means UFS)
Unlike most existing spectral analysis based unsupervised feature selection methods, we select features using the objective of K-means.
Experiments on real datasets show that our K-means UFS is more effective than the baselines in selecting features for clustering.
arXiv Detail & Related papers (2024-11-19T18:05:02Z) - Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture.
We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z) - Causal Feature Selection via Transfer Entropy [59.999594949050596]
Causal discovery aims to identify causal relationships between features with observational data.
We introduce a new causal feature selection approach that relies on the forward and backward feature selection procedures.
We provide theoretical guarantees on the regression and classification errors for both the exact and the finite-sample cases.
arXiv Detail & Related papers (2023-10-17T08:04:45Z) - Learning to Detect Interesting Anomalies [0.0]
AHUNT shows excellent performance on MNIST, CIFAR10, and Galaxy-DESI data.
AHUNT also allows the number of anomaly classes to grow organically in response to Oracle's evaluations.
arXiv Detail & Related papers (2022-10-28T18:00:06Z) - Rethinking Data Heterogeneity in Federated Learning: Introducing a New
Notion and Standard Benchmarks [65.34113135080105]
We show that not only the issue of data heterogeneity in current setups is not necessarily a problem but also in fact it can be beneficial for the FL participants.
Our observations are intuitive.
Our code is available at https://github.com/MMorafah/FL-SC-NIID.
arXiv Detail & Related papers (2022-09-30T17:15:19Z) - Unsupervised Features Ranking via Coalitional Game Theory for
Categorical Data [0.28675177318965034]
Unsupervised feature selection aims to reduce the number of features.
We show that the deriving features' selection outperforms competing methods in lowering the redundancy rate.
arXiv Detail & Related papers (2022-05-17T14:17:36Z) - Compactness Score: A Fast Filter Method for Unsupervised Feature
Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features.
Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z) - Revisiting Data Complexity Metrics Based on Morphology for Overlap and
Imbalance: Snapshot, New Overlap Number of Balls Metrics and Singular
Problems Prospect [9.666866159867444]
This research work focuses on revisiting complexity metrics based on data morphology.
Being based on ball coverage by classes, they are named after Overlap Number of Balls.
arXiv Detail & Related papers (2020-07-15T18:21:13Z) - Decorrelated Clustering with Data Selection Bias [55.91842043124102]
We propose a novel Decorrelation regularized K-Means algorithm (DCKM) for clustering with data selection bias.
Our DCKM algorithm achieves significant performance gains, indicating the necessity of removing unexpected feature correlations induced by selection bias.
arXiv Detail & Related papers (2020-06-29T08:55:50Z) - New advances in enumerative biclustering algorithms with online
partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets.
The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.