Utilizing stability criteria in choosing feature selection methods
yields reproducible results in microbiome data
- URL: http://arxiv.org/abs/2012.00001v1
- Date: Mon, 30 Nov 2020 22:23:26 GMT
- Title: Utilizing stability criteria in choosing feature selection methods
yields reproducible results in microbiome data
- Authors: Lingjing Jiang, Niina Haiminen, Anna-Paola Carrieri, Shi Huang,
Yoshiki Vazquez-Baeza, Laxmi Parida, Ho-Cheol Kim, Austin D. Swafford, Rob
Knight, Loki Natarajan
- Abstract summary: We compare the performance of popular model prediction metric MSE and proposed criterion Stability in evaluating four widely used feature selection methods.
We conclude that Stability is a preferred feature selection criterion over MSE because it better quantifies the performance of the feature selection method.
- Score: 0.9345224141195311
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Feature selection is indispensable in microbiome data analysis, but it can be
particularly challenging as microbiome data sets are high-dimensional,
underdetermined, sparse and compositional. Great efforts have recently been
made on developing new methods for feature selection that handle the above data
characteristics, but almost all methods were evaluated based on performance of
model predictions. However, little attention has been paid to address a
fundamental question: how appropriate are those evaluation criteria? Most
feature selection methods often control the model fit, but the ability to
identify meaningful subsets of features cannot be evaluated simply based on the
prediction accuracy. If tiny changes to the training data would lead to large
changes in the chosen feature subset, then many of the biological features that
an algorithm has found are likely to be a data artifact rather than real
biological signal. This crucial need of identifying relevant and reproducible
features motivated the reproducibility evaluation criterion such as Stability,
which quantifies how robust a method is to perturbations in the data. In our
paper, we compare the performance of popular model prediction metric MSE and
proposed reproducibility criterion Stability in evaluating four widely used
feature selection methods in both simulations and experimental microbiome
applications. We conclude that Stability is a preferred feature selection
criterion over MSE because it better quantifies the reproducibility of the
feature selection method.
Related papers
- Improving Omics-Based Classification: The Role of Feature Selection and Synthetic Data Generation [0.18846515534317262]
This study presents a machine learning based classification framework that integrates feature selection with data augmentation techniques.<n>We show that the proposed pipeline yields cross validated perfomance on small dataset.
arXiv Detail & Related papers (2025-05-06T10:09:50Z) - CritiQ: Mining Data Quality Criteria from Human Preferences [70.35346554179036]
We introduce CritiQ, a novel data selection method that automatically mines criteria from human preferences for data quality.
CritiQ Flow employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments.
We demonstrate the effectiveness of our method in the code, math, and logic domains.
arXiv Detail & Related papers (2025-02-26T16:33:41Z) - Stabilizing Machine Learning for Reproducible and Explainable Results: A Novel Validation Approach to Subject-Specific Insights [2.7516838144367735]
We propose a novel validation approach that uses a general ML model to ensure reproducible performance and robust feature importance analysis.
We tested a single Random Forest (RF) model on nine datasets varying in domain, sample size, and demographics.
Our repeated trials approach consistently identified key features at the subject level and improved group-level feature importance analysis.
arXiv Detail & Related papers (2024-12-16T23:14:26Z) - A Hybrid Framework for Statistical Feature Selection and Image-Based Noise-Defect Detection [55.2480439325792]
This paper presents a hybrid framework that integrates both statistical feature selection and classification techniques to improve defect detection accuracy.
We present around 55 distinguished features that are extracted from industrial images, which are then analyzed using statistical methods.
By integrating these methods with flexible machine learning applications, the proposed framework improves detection accuracy and reduces false positives and misclassifications.
arXiv Detail & Related papers (2024-12-11T22:12:21Z) - Detecting and Identifying Selection Structure in Sequential Data [53.24493902162797]
We argue that the selective inclusion of data points based on latent objectives is common in practical situations, such as music sequences.
We show that selection structure is identifiable without any parametric assumptions or interventional experiments.
We also propose a provably correct algorithm to detect and identify selection structures as well as other types of dependencies.
arXiv Detail & Related papers (2024-06-29T20:56:34Z) - A Performance-Driven Benchmark for Feature Selection in Tabular Deep
Learning [131.2910403490434]
Data scientists typically collect as many features as possible into their datasets, and even engineer new features from existing ones.
Existing benchmarks for tabular feature selection consider classical downstream models, toy synthetic datasets, or do not evaluate feature selectors on the basis of downstream performance.
We construct a challenging feature selection benchmark evaluated on downstream neural networks including transformers.
We also propose an input-gradient-based analogue of Lasso for neural networks that outperforms classical feature selection methods on challenging problems.
arXiv Detail & Related papers (2023-11-10T05:26:10Z) - Causal Feature Selection via Transfer Entropy [59.999594949050596]
Causal discovery aims to identify causal relationships between features with observational data.
We introduce a new causal feature selection approach that relies on the forward and backward feature selection procedures.
We provide theoretical guarantees on the regression and classification errors for both the exact and the finite-sample cases.
arXiv Detail & Related papers (2023-10-17T08:04:45Z) - Parallel feature selection based on the trace ratio criterion [4.30274561163157]
This work presents a novel parallel feature selection approach for classification, namely Parallel Feature Selection using Trace criterion (PFST)
Our method uses trace criterion, a measure of class separability used in Fisher's Discriminant Analysis, to evaluate feature usefulness.
The experiments show that our method can produce a small set of features in a fraction of the amount of time by the other methods under comparison.
arXiv Detail & Related papers (2022-03-03T10:50:33Z) - Loss-guided Stability Selection [0.0]
It is well-known that model selection procedures like the Lasso or Boosting tend to overfit on real data.
Standard Stability Selection is based on a global criterion, namely the per-family error rate.
We propose a Stability Selection variant which respects the chosen loss function via an additional validation step.
arXiv Detail & Related papers (2022-02-10T11:20:25Z) - Compactness Score: A Fast Filter Method for Unsupervised Feature
Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features.
Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Filter Methods for Feature Selection in Supervised Machine Learning
Applications -- Review and Benchmark [0.0]
This review synthesizes the literature on feature selection benchmarking and evaluates the performance of 58 methods in the widely used R environment.
We consider four typical dataset scenarios that are challenging for ML models.
arXiv Detail & Related papers (2021-11-23T20:20:24Z) - Feature Selection Using Reinforcement Learning [0.0]
The space of variables or features that can be used to characterize a particular predictor of interest continues to grow exponentially.
Identifying the most characterizing features that minimizes the variance without jeopardizing the bias of our models is critical to successfully training a machine learning model.
arXiv Detail & Related papers (2021-01-23T09:24:37Z) - Leveraging Model Inherent Variable Importance for Stable Online Feature
Selection [16.396739487911056]
We introduce FIRES, a novel framework for online feature selection.
Our framework is generic in that it leaves the choice of the underlying model to the user.
Experiments show that the proposed framework is clearly superior in terms of feature selection stability.
arXiv Detail & Related papers (2020-06-18T10:01:18Z) - Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management.
We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.