Improving statistical learning methods via features selection without replacement sampling and random projection
- URL: http://arxiv.org/abs/2506.00053v1
- Date: Wed, 28 May 2025 22:36:46 GMT
- Title: Improving statistical learning methods via features selection without replacement sampling and random projection
- Authors: Sulaiman khan, Muhammad Ahmad, Fida Ullah, Carlos Aguilar Ibañez, José Eduardo Valdez Rodriguez,
- Abstract summary: Cancer is a genetic disease characterized by genetic and epigenetic alterations that disrupt normal gene expression.<n>High-dimensional microarray datasets pose challenges for classification models due to the "small n, large p" problem.<n>This study contributes to cancer biomarker discovery, offering a robust computational method for analyzing microarray data.
- Score: 0.680740878601496
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cancer is fundamentally a genetic disease characterized by genetic and epigenetic alterations that disrupt normal gene expression, leading to uncontrolled cell growth and metastasis. High-dimensional microarray datasets pose challenges for classification models due to the "small n, large p" problem, resulting in overfitting. This study makes three different key contributions: 1) we propose a machine learning-based approach integrating the Feature Selection Without Re-placement (FSWOR) technique and a projection method to improve classification accuracy. 2) We apply the Kendall statistical test to identify the most significant genes from the brain cancer mi-croarray dataset (GSE50161), reducing the feature space from 54,675 to 20,890 genes.3) we apply machine learning models using k-fold cross validation techniques in which our model incorpo-rates ensemble classifiers with LDA projection and Na\"ive Bayes, achieving a test score of 96%, outperforming existing methods by 9.09%. The results demonstrate the effectiveness of our ap-proach in high-dimensional gene expression analysis, improving classification accuracy while mitigating overfitting. This study contributes to cancer biomarker discovery, offering a robust computational method for analyzing microarray data.
Related papers
- Machine Learning Algorithm for Noise Reduction and Disease-Causing Gene Feature Extraction in Gene Sequencing Data [4.2547679858666285]
We propose a machine learning-based method for noise reduction and disease-causing gene feature extraction in gene sequencing DeepSeqDenoise algorithm.<n>We screened 17 key features by feature engineering, and constructed an integrated learning model to predict disease-causing genes with 94.3% accuracy.
arXiv Detail & Related papers (2025-05-26T09:23:09Z) - Survey and Improvement Strategies for Gene Prioritization with Large Language Models [61.24568051916653]
Large language models (LLMs) have performed well in medical exams, but their effectiveness in diagnosing rare genetic diseases has not been assessed.<n>We used multi-agent and Human Phenotype Ontology (HPO) classification to categorized patients based on phenotypes and solvability levels.<n>At baseline, GPT-4 outperformed other LLMs, achieving near 30% accuracy in ranking causal genes correctly.
arXiv Detail & Related papers (2025-01-30T23:03:03Z) - An Evolutional Neural Network Framework for Classification of Microarray Data [0.0]
This research aims to apply a hybrid model of Genetic Algorithm and Neural Network to overcome the problem during subset selection of informative genes.
Experimental results show the proposed method suggested high accuracy and minimum number of selected genes in comparison with other machine learning algorithms.
arXiv Detail & Related papers (2024-11-20T13:48:40Z) - Single-Cell Deep Clustering Method Assisted by Exogenous Gene
Information: A Novel Approach to Identifying Cell Types [50.55583697209676]
We develop an attention-enhanced graph autoencoder, which is designed to efficiently capture the topological features between cells.
During the clustering process, we integrated both sets of information and reconstructed the features of both cells and genes to generate a discriminative representation.
This research offers enhanced insights into the characteristics and distribution of cells, thereby laying the groundwork for early diagnosis and treatment of diseases.
arXiv Detail & Related papers (2023-11-28T09:14:55Z) - A Hybrid Machine Learning Model for Classifying Gene Mutations in Cancer using LSTM, BiLSTM, CNN, GRU, and GloVe [0.0]
We introduce a novel hybrid ensemble model that synergistically combines LSTM, BiLSTM, CNN, GRU, and GloVe embeddings for the classification of gene mutations in cancer.
Our approach achieved a training accuracy of 80.6%, precision of 81.6%, recall of 80.6%, and an F1 score of 83.1%, alongside a significantly reduced Mean Squared Error (MSE) of 2.596.
arXiv Detail & Related papers (2023-07-24T21:01:46Z) - Fast and Functional Structured Data Generators Rooted in Out-of-Equilibrium Physics [44.97217246897902]
We address the challenge of using energy-based models to produce high-quality, label-specific data in structured datasets.
Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing.
We use a novel training algorithm that exploits non-equilibrium effects.
arXiv Detail & Related papers (2023-07-13T15:08:44Z) - Fuzzy Gene Selection and Cancer Classification Based on Deep Learning
Model [1.3072222152900117]
We developed a new fuzzy gene selection technique (FGS) to identify informative genes to facilitate cancer classification.
With our FGS-enhanced method, the cancer classification model achieved 96.5%,96.2%,96%, and 95.9% for accuracy, precision, recall, and f1-score respectively.
In examining the six datasets that were used, the proposed model demonstrates it's capacity to classify cancer effectively.
arXiv Detail & Related papers (2023-05-04T21:52:57Z) - Machine Learning Methods for Cancer Classification Using Gene Expression
Data: A Review [77.34726150561087]
Cancer is the second major cause of death after cardiovascular diseases.
Gene expression can play a fundamental role in the early detection of cancer.
This study reviews recent progress in gene expression analysis for cancer classification using machine learning methods.
arXiv Detail & Related papers (2023-01-28T15:03:03Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - A Deep Embedded Refined Clustering Approach for Breast Cancer
Distinction based on DNA Methylation [0.0]
We propose a deep embedded refined clustering method for breast cancer differentiation based on DNA methylation.
The proposed approach is composed of two main stages. The first stage consists in the dimensionality reduction of the methylation data based on an autoencoder.
The second stage is a clustering algorithm based on the soft-assignment of the latent space provided by the autoencoder.
arXiv Detail & Related papers (2021-02-18T16:46:25Z) - Cancer Gene Profiling through Unsupervised Discovery [49.28556294619424]
We introduce a novel, automatic and unsupervised framework to discover low-dimensional gene biomarkers.
Our method is based on the LP-Stability algorithm, a high dimensional center-based unsupervised clustering algorithm.
Our signature reports promising results on distinguishing immune inflammatory and immune desert tumors.
arXiv Detail & Related papers (2021-02-11T09:04:45Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.