Analysis of ensemble feature selection for correlated high-dimensional
RNA-Seq cancer data
- URL: http://arxiv.org/abs/2004.13809v1
- Date: Tue, 28 Apr 2020 20:38:53 GMT
- Title: Analysis of ensemble feature selection for correlated high-dimensional
RNA-Seq cancer data
- Authors: Aneta Polewko-Klim, Witold R. Rudnicki
- Abstract summary: This study compares two approaches for the discovery of relevant variables.
The most informative features are identified using a four feature selection algorithms.
Unfortunately, models built on feature sets obtained from the ensemble of feature selection algorithms were no better than for models developed on feature sets obtained from individual algorithms.
- Score: 0.24366811507669126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Discovery of diagnostic and prognostic molecular markers is important and
actively pursued the research field in cancer research. For complex diseases,
this process is often performed using Machine Learning. The current study
compares two approaches for the discovery of relevant variables: by application
of a single feature selection algorithm, versus by an ensemble of diverse
algorithms. These approaches are used to identify variables that are relevant
discerning of four cancer types using RNA-seq profiles from the Cancer Genome
Atlas. The comparison is carried out in two directions: evaluating the
predictive performance of models and monitoring the stability of selected
variables. The most informative features are identified using a four feature
selection algorithms, namely U-test, ReliefF, and two variants of the MDFS
algorithm. Discerning normal and tumor tissues is performed using the Random
Forest algorithm. The highest stability of the feature set was obtained when
U-test was used. Unfortunately, models built on feature sets obtained from the
ensemble of feature selection algorithms were no better than for models
developed on feature sets obtained from individual algorithms. On the other
hand, the feature selectors leading to the best classification results varied
between data sets.
Related papers
- Optimal Kernel Choice for Score Function-based Causal Discovery [92.65034439889872]
We propose a kernel selection method within the generalized score function that automatically selects the optimal kernel that best fits the data.
We conduct experiments on both synthetic data and real-world benchmarks, and the results demonstrate that our proposed method outperforms kernel selection methods.
arXiv Detail & Related papers (2024-07-14T09:32:20Z) - Exhaustive Exploitation of Nature-inspired Computation for Cancer Screening in an Ensemble Manner [20.07173196364489]
This study presents a framework termed Evolutionary Optimized Diverse Ensemble Learning (EODE) to improve ensemble learning for cancer classification from gene expression data.
Experiments were conducted across 35 gene expression benchmark datasets encompassing varied cancer types.
arXiv Detail & Related papers (2024-04-06T08:07:48Z) - Feature Selection as Deep Sequential Generative Learning [50.00973409680637]
We develop a deep variational transformer model over a joint of sequential reconstruction, variational, and performance evaluator losses.
Our model can distill feature selection knowledge and learn a continuous embedding space to map feature selection decision sequences into embedding vectors associated with utility scores.
arXiv Detail & Related papers (2024-03-06T16:31:56Z) - Dual-stage optimizer for systematic overestimation adjustment applied to
multi-objective genetic algorithms for biomarker selection [0.18648070031379424]
Biomarker identification with feature selection methods can be addressed as a multi-objective problem with trade-offs between predictive ability and parsimony in the number of features.
We propose DOSA-MO, a novel multi-objective optimization wrapper algorithm that learns how the original estimation, its variance, and the feature set size of the solutions predict the overestimation.
arXiv Detail & Related papers (2023-12-27T16:13:14Z) - An Application of a Multivariate Estimation of Distribution Algorithm to
Cancer Chemotherapy [59.40521061783166]
Chemotherapy treatment for cancer is a complex optimisation problem with a large number of interacting variables and constraints.
We show that the more sophisticated algorithm would yield better performance on a complex problem like this.
We hypothesise that this is caused by the more sophisticated algorithm being impeded by the large number of interactions in the problem.
arXiv Detail & Related papers (2022-05-17T15:28:46Z) - Improving RNA Secondary Structure Design using Deep Reinforcement
Learning [69.63971634605797]
We propose a new benchmark of applying reinforcement learning to RNA sequence design, in which the objective function is defined to be the free energy in the sequence's secondary structure.
We show results of the ablation analysis that we do for these algorithms, as well as graphs indicating the algorithm's performance across batches.
arXiv Detail & Related papers (2021-11-05T02:54:06Z) - A Study of Feature Selection and Extraction Algorithms for Cancer
Subtype Prediction [0.0]
We show that the existing feature selection methods are computationally expensive when applied individually.
We apply these algorithms sequentially which helps in lowering the computational cost and improving the predictive performance.
arXiv Detail & Related papers (2021-09-29T18:11:24Z) - A Systematic Characterization of Sampling Algorithms for Open-ended
Language Generation [71.31905141672529]
We study the widely adopted ancestral sampling algorithms for auto-regressive language models.
We identify three key properties that are shared among them: entropy reduction, order preservation, and slope preservation.
We find that the set of sampling algorithms that satisfies these properties performs on par with the existing sampling algorithms.
arXiv Detail & Related papers (2020-09-15T17:28:42Z) - A Novel Community Detection Based Genetic Algorithm for Feature
Selection [3.8848561367220276]
Authors propose a genetic algorithm based on community detection, which functions in three steps.
Nine benchmark classification problems were analyzed in terms of the performance of the presented approach.
arXiv Detail & Related papers (2020-08-08T15:39:30Z) - A generalised OMP algorithm for feature selection with application to
gene expression data [1.969028842568933]
To apply to molecular data, feature selection algorithms need to be scalable to tens of thousands of available features.
We propose gOMP, a highly-scalable generalisation of the Orthogonal Matching Pursuit feature selection algorithm.
arXiv Detail & Related papers (2020-04-01T08:33:02Z) - Stepwise Model Selection for Sequence Prediction via Deep Kernel
Learning [100.83444258562263]
We propose a novel Bayesian optimization (BO) algorithm to tackle the challenge of model selection in this setting.
In order to solve the resulting multiple black-box function optimization problem jointly and efficiently, we exploit potential correlations among black-box functions.
We are the first to formulate the problem of stepwise model selection (SMS) for sequence prediction, and to design and demonstrate an efficient joint-learning algorithm for this purpose.
arXiv Detail & Related papers (2020-01-12T09:42:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.