BOLIMES: Boruta and LIME optiMized fEature Selection for Gene Expression Classification
- URL: http://arxiv.org/abs/2502.13080v1
- Date: Tue, 18 Feb 2025 17:33:41 GMT
- Title: BOLIMES: Boruta and LIME optiMized fEature Selection for Gene Expression Classification
- Authors: Bich-Chung Phan, Thanh Ma, Huu-Hoa Nguyen, and Thanh-Nghi Do,
- Abstract summary: BOLIMES is a novel feature selection algorithm designed to enhance gene expression classification.
It combines exhaustive feature selection with interpretability-driven refinement, offering a powerful solution for high-dimensional gene expression analysis.
- Score: 0.08738116412366388
- License:
- Abstract: Gene expression classification is a pivotal yet challenging task in bioinformatics, primarily due to the high dimensionality of genomic data and the risk of overfitting. To bridge this gap, we propose BOLIMES, a novel feature selection algorithm designed to enhance gene expression classification by systematically refining the feature subset. Unlike conventional methods that rely solely on statistical ranking or classifier-specific selection, we integrate the robustness of Boruta with the interpretability of LIME, ensuring that only the most relevant and influential genes are retained. BOLIMES first employs Boruta to filter out non-informative genes by comparing each feature against its randomized counterpart, thus preserving valuable information. It then uses LIME to rank the remaining genes based on their local importance to the classifier. Finally, an iterative classification evaluation determines the optimal feature subset by selecting the number of genes that maximizes predictive accuracy. By combining exhaustive feature selection with interpretability-driven refinement, our solution effectively balances dimensionality reduction with high classification performance, offering a powerful solution for high-dimensional gene expression analysis.
Related papers
- Prediction by Machine Learning Analysis of Genomic Data Phenotypic Frost Tolerance in Perccottus glenii [7.412214379486083]
We will employ machine learning techniques to analyze the gene sequences of Perccottus glenii.
We constructed four classification models: Random Forest, LightGBM, XGBoost, and Decision Tree.
The dataset used by these classification models was extracted from the National Center for Biotechnology Information database.
arXiv Detail & Related papers (2024-10-11T14:45:47Z) - Predicting Genetic Mutation from Whole Slide Images via Biomedical-Linguistic Knowledge Enhanced Multi-label Classification [119.13058298388101]
We develop a Biological-knowledge enhanced PathGenomic multi-label Transformer to improve genetic mutation prediction performances.
BPGT first establishes a novel gene encoder that constructs gene priors by two carefully designed modules.
BPGT then designs a label decoder that finally performs genetic mutation prediction by two tailored modules.
arXiv Detail & Related papers (2024-06-05T06:42:27Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Exhaustive Exploitation of Nature-inspired Computation for Cancer Screening in an Ensemble Manner [20.07173196364489]
This study presents a framework termed Evolutionary Optimized Diverse Ensemble Learning (EODE) to improve ensemble learning for cancer classification from gene expression data.
Experiments were conducted across 35 gene expression benchmark datasets encompassing varied cancer types.
arXiv Detail & Related papers (2024-04-06T08:07:48Z) - Feature Selection as Deep Sequential Generative Learning [50.00973409680637]
We develop a deep variational transformer model over a joint of sequential reconstruction, variational, and performance evaluator losses.
Our model can distill feature selection knowledge and learn a continuous embedding space to map feature selection decision sequences into embedding vectors associated with utility scores.
arXiv Detail & Related papers (2024-03-06T16:31:56Z) - Feature Selection via Robust Weighted Score for High Dimensional Binary
Class-Imbalanced Gene Expression Data [1.2891210250935148]
A robust weighted score for unbalanced data (ROWSU) is proposed for selecting the most discriminative feature for high dimensional gene expression binary classification with class-imbalance problem.
The performance of the proposed ROWSU method is evaluated on $6$ gene expression datasets.
arXiv Detail & Related papers (2024-01-23T11:22:03Z) - A Performance-Driven Benchmark for Feature Selection in Tabular Deep
Learning [131.2910403490434]
Data scientists typically collect as many features as possible into their datasets, and even engineer new features from existing ones.
Existing benchmarks for tabular feature selection consider classical downstream models, toy synthetic datasets, or do not evaluate feature selectors on the basis of downstream performance.
We construct a challenging feature selection benchmark evaluated on downstream neural networks including transformers.
We also propose an input-gradient-based analogue of Lasso for neural networks that outperforms classical feature selection methods on challenging problems.
arXiv Detail & Related papers (2023-11-10T05:26:10Z) - Multivariate feature ranking of gene expression data [62.997667081978825]
We propose two new multivariate feature ranking methods based on pairwise correlation and pairwise consistency.
We statistically prove that the proposed methods outperform the state of the art feature ranking methods Clustering Variation, Chi Squared, Correlation, Information Gain, ReliefF and Significance.
arXiv Detail & Related papers (2021-11-03T17:19:53Z) - Cancer Gene Profiling through Unsupervised Discovery [49.28556294619424]
We introduce a novel, automatic and unsupervised framework to discover low-dimensional gene biomarkers.
Our method is based on the LP-Stability algorithm, a high dimensional center-based unsupervised clustering algorithm.
Our signature reports promising results on distinguishing immune inflammatory and immune desert tumors.
arXiv Detail & Related papers (2021-02-11T09:04:45Z) - Latent regularization for feature selection using kernel methods in
tumor classification [1.9078991171384014]
Feature selection is a useful approach to select the key genes which helps to classify tumors.
We propose a feature selection method based on Multiple Kernel Learning that results in a reduced subset of genes and a custom kernel.
An improvement of the generalization capacity is obtained and assessed by the tumor classification performance on new unseen test samples.
arXiv Detail & Related papers (2020-04-10T00:46:02Z) - A New Gene Selection Algorithm using Fuzzy-Rough Set Theory for Tumor
Classification [0.0]
We present a new technique for gene selection using a discernibility matrix of fuzzy-rough sets.
The proposed technique takes into account the similarity of those instances that have the same and different class labels to improve the gene selection results.
Experimental results demonstrate that this technique provides better efficiency compared to the state-of-the-art approaches.
arXiv Detail & Related papers (2020-03-26T13:43:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.