An information theoretic approach to quantify the stability of feature
selection and ranking algorithms
- URL: http://arxiv.org/abs/2402.05295v1
- Date: Wed, 7 Feb 2024 22:17:37 GMT
- Title: An information theoretic approach to quantify the stability of feature
selection and ranking algorithms
- Authors: Alaiz-Rodriguez, R., and Parnell, A. C
- Abstract summary: We propose an information theoretic approach based on the Jensen Shannon divergence to quantify this robustness.
Unlike other stability measures, this metric is suitable for different algorithm outcomes: full ranked lists, feature subsets as well as the lesser studied partial ranked lists.
We illustrate the use of this stability metric with data generated in a fully controlled way and compare it with popular metrics including the Spearmans rank correlation and the Kunchevas index on feature ranking and selection outcomes, respectively.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Feature selection is a key step when dealing with high dimensional data. In
particular, these techniques simplify the process of knowledge discovery from
the data by selecting the most relevant features out of the noisy, redundant
and irrelevant features. A problem that arises in many of these practical
applications is that the outcome of the feature selection algorithm is not
stable. Thus, small variations in the data may yield very different feature
rankings. Assessing the stability of these methods becomes an important issue
in the previously mentioned situations. We propose an information theoretic
approach based on the Jensen Shannon divergence to quantify this robustness.
Unlike other stability measures, this metric is suitable for different
algorithm outcomes: full ranked lists, feature subsets as well as the lesser
studied partial ranked lists. This generalized metric quantifies the difference
among a whole set of lists with the same size, following a probabilistic
approach and being able to give more importance to the disagreements that
appear at the top of the list. Moreover, it possesses desirable properties
including correction for change, upper lower bounds and conditions for a
deterministic selection. We illustrate the use of this stability metric with
data generated in a fully controlled way and compare it with popular metrics
including the Spearmans rank correlation and the Kunchevas index on feature
ranking and selection outcomes, respectively. Additionally, experimental
validation of the proposed approach is carried out on a real-world problem of
food quality assessment showing its potential to quantify stability from
different perspectives.
Related papers
- Automatic feature selection and weighting using Differentiable Information Imbalance [41.452380773977154]
We introduce the Differentiable Information Imbalance (DII), an automatic data analysis method to rank information content between sets of features.
Based on the nearest neighbors according to distances in the ground truth feature space, the method finds a low-dimensional subset of the input features.
By employing the Differentiable Information Imbalance as a loss function, the relative feature weights of the inputs are optimized, simultaneously performing unit alignment and relative importance scaling.
arXiv Detail & Related papers (2024-10-30T11:19:10Z) - Detecting and Identifying Selection Structure in Sequential Data [53.24493902162797]
We argue that the selective inclusion of data points based on latent objectives is common in practical situations, such as music sequences.
We show that selection structure is identifiable without any parametric assumptions or interventional experiments.
We also propose a provably correct algorithm to detect and identify selection structures as well as other types of dependencies.
arXiv Detail & Related papers (2024-06-29T20:56:34Z) - Towards stable real-world equation discovery with assessing
differentiating quality influence [52.2980614912553]
We propose alternatives to the commonly used finite differences-based method.
We evaluate these methods in terms of applicability to problems, similar to the real ones, and their ability to ensure the convergence of equation discovery algorithms.
arXiv Detail & Related papers (2023-11-09T23:32:06Z) - Causal Feature Selection via Transfer Entropy [59.999594949050596]
Causal discovery aims to identify causal relationships between features with observational data.
We introduce a new causal feature selection approach that relies on the forward and backward feature selection procedures.
We provide theoretical guarantees on the regression and classification errors for both the exact and the finite-sample cases.
arXiv Detail & Related papers (2023-10-17T08:04:45Z) - Compactness Score: A Fast Filter Method for Unsupervised Feature
Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features.
Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z) - An Evolutionary Correlation-aware Feature Selection Method for
Classification Problems [3.2550305883611244]
In this paper, an estimation of distribution algorithm is proposed to meet three goals.
Firstly, as an extension of EDA, the proposed method generates only two individuals in each iteration that compete based on a fitness function.
Secondly, we provide a guiding technique for determining the number of features for individuals in each iteration.
As the main contribution of the paper, in addition to considering the importance of each feature alone, the proposed method can consider the interaction between features.
arXiv Detail & Related papers (2021-10-16T20:20:43Z) - Employing an Adjusted Stability Measure for Multi-Criteria Model Fitting
on Data Sets with Similar Features [0.1127980896956825]
We show that our approach achieves the same or better predictive performance compared to the two established approaches.
Our approach succeeds at selecting the relevant features while avoiding irrelevant or redundant features.
For data sets with many similar features, the feature selection stability must be evaluated with an adjusted stability measure.
arXiv Detail & Related papers (2021-06-15T12:48:07Z) - BayesIMP: Uncertainty Quantification for Causal Data Fusion [52.184885680729224]
We study the causal data fusion problem, where datasets pertaining to multiple causal graphs are combined to estimate the average treatment effect of a target variable.
We introduce a framework which combines ideas from probabilistic integration and kernel mean embeddings to represent interventional distributions in the reproducing kernel Hilbert space.
arXiv Detail & Related papers (2021-06-07T10:14:18Z) - Feature Selection Using Reinforcement Learning [0.0]
The space of variables or features that can be used to characterize a particular predictor of interest continues to grow exponentially.
Identifying the most characterizing features that minimizes the variance without jeopardizing the bias of our models is critical to successfully training a machine learning model.
arXiv Detail & Related papers (2021-01-23T09:24:37Z) - Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking
Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data.
There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups.
We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z) - The best way to select features? [0.0]
Three feature selection algorithms MDA, LIME, and SHAP are compared.
We find LIME to be more stable than MDA, and LIME is at least as stable as SHAP for the top ranked features.
arXiv Detail & Related papers (2020-05-26T02:20:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.