Improving cluster recovery with feature rescaling factors
- URL: http://arxiv.org/abs/2012.00477v1
- Date: Tue, 1 Dec 2020 13:29:35 GMT
- Title: Improving cluster recovery with feature rescaling factors
- Authors: Renato Cordeiro de Amorim and Vladimir Makarenkov
- Abstract summary: We argue that the rescaling procedure should not treat all features identically.
We introduce a feature rescaling method that takes into account the within-cluster degree of relevance of each feature.
Our comprehensive simulation study, carried out on real and synthetic data, with and without noise features, clearly demonstrates that clustering methods that use the proposed data normalization strategy clearly outperform those that use traditional data normalization.
- Score: 2.4366811507669124
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The data preprocessing stage is crucial in clustering. Features may describe
entities using different scales. To rectify this, one usually applies feature
normalisation aiming at rescaling features so that none of them overpowers the
others in the objective function of the selected clustering algorithm. In this
paper, we argue that the rescaling procedure should not treat all features
identically. Instead, it should favour the features that are more meaningful
for clustering. With this in mind, we introduce a feature rescaling method that
takes into account the within-cluster degree of relevance of each feature. Our
comprehensive simulation study, carried out on real and synthetic data, with
and without noise features, clearly demonstrates that clustering methods that
use the proposed data normalization strategy clearly outperform those that use
traditional data normalization.
Related papers
- GCC: Generative Calibration Clustering [55.44944397168619]
We propose a novel Generative Clustering (GCC) method to incorporate feature learning and augmentation into clustering procedure.
First, we develop a discrimirative feature alignment mechanism to discover intrinsic relationship across real and generated samples.
Second, we design a self-supervised metric learning to generate more reliable cluster assignment.
arXiv Detail & Related papers (2024-04-14T01:51:11Z) - Spectral Clustering of Categorical and Mixed-type Data via Extra Graph
Nodes [0.0]
This paper explores a more natural way to incorporate both numerical and categorical information into the spectral clustering algorithm.
We propose adding extra nodes corresponding to the different categories the data may belong to and show that it leads to an interpretable clustering objective function.
We demonstrate that this simple framework leads to a linear-time spectral clustering algorithm for categorical-only data.
arXiv Detail & Related papers (2024-03-08T20:49:49Z) - Causal Feature Selection via Transfer Entropy [59.999594949050596]
Causal discovery aims to identify causal relationships between features with observational data.
We introduce a new causal feature selection approach that relies on the forward and backward feature selection procedures.
We provide theoretical guarantees on the regression and classification errors for both the exact and the finite-sample cases.
arXiv Detail & Related papers (2023-10-17T08:04:45Z) - Nonlinear Feature Aggregation: Two Algorithms driven by Theory [45.3190496371625]
Real-world machine learning applications are characterized by a huge number of features, leading to computational and memory issues.
We propose a dimensionality reduction algorithm (NonLinCFA) which aggregates non-linear transformations of features with a generic aggregation function.
We also test the algorithms on synthetic and real-world datasets, performing regression and classification tasks, showing competitive performances.
arXiv Detail & Related papers (2023-06-19T19:57:33Z) - Algorithm-Agnostic Interpretations for Clustering [0.0]
We propose algorithm-agnostic interpretation methods to explain clustering outcomes in reduced dimensions.
The permutation feature importance for clustering represents a general framework based on shuffling feature values.
All methods can be used with any clustering algorithm able to reassign instances through soft or hard labels.
arXiv Detail & Related papers (2022-09-21T18:08:40Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Learning Debiased and Disentangled Representations for Semantic
Segmentation [52.35766945827972]
We propose a model-agnostic and training scheme for semantic segmentation.
By randomly eliminating certain class information in each training iteration, we effectively reduce feature dependencies among classes.
Models trained with our approach demonstrate strong results on multiple semantic segmentation benchmarks.
arXiv Detail & Related papers (2021-10-31T16:15:09Z) - Adaptive Graph-based Generalized Regression Model for Unsupervised
Feature Selection [11.214334712819396]
How to select the uncorrelated and discriminative features is the key problem of unsupervised feature selection.
We present a novel generalized regression model imposed by an uncorrelated constraint and the $ell_2,1$-norm regularization.
It can simultaneously select the uncorrelated and discriminative features as well as reduce the variance of these data points belonging to the same neighborhood.
arXiv Detail & Related papers (2020-12-27T09:07:26Z) - Decorrelated Clustering with Data Selection Bias [55.91842043124102]
We propose a novel Decorrelation regularized K-Means algorithm (DCKM) for clustering with data selection bias.
Our DCKM algorithm achieves significant performance gains, indicating the necessity of removing unexpected feature correlations induced by selection bias.
arXiv Detail & Related papers (2020-06-29T08:55:50Z) - New advances in enumerative biclustering algorithms with online
partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets.
The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z) - Simple and Scalable Sparse k-means Clustering via Feature Ranking [14.839931533868176]
We propose a novel framework for sparse k-means clustering that is intuitive, simple to implement, and competitive with state-of-the-art algorithms.
Our core method readily generalizes to several task-specific algorithms such as clustering on subsets of attributes and in partially observed data settings.
arXiv Detail & Related papers (2020-02-20T02:41:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.