Related papers: Cluster Metric Sensitivity to Irrelevant Features

Cluster Metric Sensitivity to Irrelevant Features

URL: http://arxiv.org/abs/2402.12008v1
Date: Mon, 19 Feb 2024 10:02:00 GMT
Title: Cluster Metric Sensitivity to Irrelevant Features
Authors: Miles McCrory and Spencer A. Thomas
Abstract summary: We show how different types of irrelevant variables can impact the outcome of a clustering result from $k$-means in different ways. Our results show that the Silhouette Coefficient and the Davies-Bouldin score are the most sensitive to irrelevant added features.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Clustering algorithms are used extensively in data analysis for data exploration and discovery. Technological advancements lead to continually growth of data in terms of volume, dimensionality and complexity. This provides great opportunities in data analytics as the data can be interrogated for many different purposes. This however leads challenges, such as identification of relevant features for a given task. In supervised tasks, one can utilise a number of methods to optimise the input features for the task objective (e.g. classification accuracy). In unsupervised problems, such tools are not readily available, in part due to an inability to quantify feature relevance in unlabeled tasks. In this paper, we investigate the sensitivity of clustering performance noisy uncorrelated variables iteratively added to baseline datasets with well defined clusters. We show how different types of irrelevant variables can impact the outcome of a clustering result from $k$-means in different ways. We observe a resilience to very high proportions of irrelevant features for adjusted rand index (ARI) and normalised mutual information (NMI) when the irrelevant features are Gaussian distributed. For Uniformly distributed irrelevant features, we notice the resilience of ARI and NMI is dependent on the dimensionality of the data and exhibits tipping points between high scores and near zero. Our results show that the Silhouette Coefficient and the Davies-Bouldin score are the most sensitive to irrelevant added features exhibiting large changes in score for comparably low proportions of irrelevant features regardless of underlying distribution or data scaling. As such the Silhouette Coefficient and the Davies-Bouldin score are good candidates for optimising feature selection in unsupervised clustering tasks.

Related papers

Improving clustering quality evaluation in noisy Gaussian mixtures [2.3940819037450987]
We introduce a theoretically grounded Feature Importance Rescaling (FIR) method that enhances the quality of clustering validation. We demonstrate that FIR consistently improves the correlation between the values of cluster validity indices and the ground truth, particularly in settings with noisy or irrelevant features.
arXiv Detail & Related papers (2025-03-01T07:11:30Z)
K-means Derived Unsupervised Feature Selection using Improved ADMM [25.145984747164256]
This paper presents a novel method called K-means Derived Unsupervised Feature Selection (K-means UFS) Unlike most existing spectral analysis based unsupervised feature selection methods, we select features using the objective of K-means. Experiments on real datasets show that our K-means UFS is more effective than the baselines in selecting features for clustering.
arXiv Detail & Related papers (2024-11-19T18:05:02Z)
Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture. We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z)
Causal Feature Selection via Transfer Entropy [59.999594949050596]
Causal discovery aims to identify causal relationships between features with observational data. We introduce a new causal feature selection approach that relies on the forward and backward feature selection procedures. We provide theoretical guarantees on the regression and classification errors for both the exact and the finite-sample cases.
arXiv Detail & Related papers (2023-10-17T08:04:45Z)
Learning to Detect Interesting Anomalies [0.0]
AHUNT shows excellent performance on MNIST, CIFAR10, and Galaxy-DESI data. AHUNT also allows the number of anomaly classes to grow organically in response to Oracle's evaluations.
arXiv Detail & Related papers (2022-10-28T18:00:06Z)
Rethinking Data Heterogeneity in Federated Learning: Introducing a New Notion and Standard Benchmarks [65.34113135080105]
We show that not only the issue of data heterogeneity in current setups is not necessarily a problem but also in fact it can be beneficial for the FL participants. Our observations are intuitive. Our code is available at https://github.com/MMorafah/FL-SC-NIID.
arXiv Detail & Related papers (2022-09-30T17:15:19Z)
Unsupervised Features Ranking via Coalitional Game Theory for Categorical Data [0.28675177318965034]
Unsupervised feature selection aims to reduce the number of features. We show that the deriving features' selection outperforms competing methods in lowering the redundancy rate.
arXiv Detail & Related papers (2022-05-17T14:17:36Z)
Compactness Score: A Fast Filter Method for Unsupervised Feature Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features. Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z)
Revisiting Data Complexity Metrics Based on Morphology for Overlap and Imbalance: Snapshot, New Overlap Number of Balls Metrics and Singular Problems Prospect [9.666866159867444]
This research work focuses on revisiting complexity metrics based on data morphology. Being based on ball coverage by classes, they are named after Overlap Number of Balls.
arXiv Detail & Related papers (2020-07-15T18:21:13Z)
Decorrelated Clustering with Data Selection Bias [55.91842043124102]
We propose a novel Decorrelation regularized K-Means algorithm (DCKM) for clustering with data selection bias. Our DCKM algorithm achieves significant performance gains, indicating the necessity of removing unexpected feature correlations induced by selection bias.
arXiv Detail & Related papers (2020-06-29T08:55:50Z)
New advances in enumerative biclustering algorithms with online partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets. The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.