A Pan-cancer Classification Model using Multi-view Feature Selection Method and Ensemble Classifier
- URL: http://arxiv.org/abs/2501.06805v1
- Date: Sun, 12 Jan 2025 13:06:01 GMT
- Title: A Pan-cancer Classification Model using Multi-view Feature Selection Method and Ensemble Classifier
- Authors: Tareque Mohmud Chowdhury, Farzana Tabassum, Sabrina Islam, Abu Raihan Mostofa Kamal,
- Abstract summary: We develop a novel feature selection framework specifically for transcriptome data.<n>We construct two ensemble ML models based on LR, SVM and XGBoost.<n>With 97.11% accuracy and 0.9996 AUC value, our approach performs better than existing methods to classify 33 types of cancers.
- Score: 0.046873264197900916
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurately identifying cancer samples is crucial for precise diagnosis and effective patient treatment. Traditional methods falter with high-dimensional and high feature-to-sample count ratios, which are critical for classifying cancer samples. This study aims to develop a novel feature selection framework specifically for transcriptome data and propose two ensemble classifiers. For feature selection, we partition the transcriptome dataset vertically based on feature types. Then apply the Boruta feature selection process on each of the partitions, combine the results, and apply Boruta again on the combined result. We repeat the process with different parameters of Boruta and prepare the final feature set. Finally, we constructed two ensemble ML models based on LR, SVM and XGBoost classifiers with max voting and averaging probability approach. We used 10-fold cross-validation to ensure robust and reliable classification performance. With 97.11\% accuracy and 0.9996 AUC value, our approach performs better compared to existing state-of-the-art methods to classify 33 types of cancers. A set of 12 types of cancer is traditionally challenging to differentiate between each other due to their similarity in tissue of origin. Our method accurately identifies over 90\% of samples from these 12 types of cancers, which outperforms all known methods presented in existing literature. The gene set enrichment analysis reveals that our framework's selected features have enriched the pathways highly related to cancers. This study develops a feature selection framework to select features highly related to cancer development and leads to identifying different types of cancer samples with higher accuracy.
Related papers
- Graph Kolmogorov-Arnold Networks for Multi-Cancer Classification and Biomarker Identification, An Interpretable Multi-Omics Approach [36.92842246372894]
Multi-Omics Graph Kolmogorov-Arnold Network (MOGKAN) is a deep learning framework that utilizes messenger-RNA, micro-RNA sequences, and DNA methylation samples.
By integrating multi-omics data with graph-based deep learning, our proposed approach demonstrates robust predictive performance and interpretability.
arXiv Detail & Related papers (2025-03-29T02:14:05Z) - A Comparative Analysis of Image Descriptors for Histopathological Classification of Gastric Cancer [39.69192026190426]
Gastric cancer ranks as the fifth most common and fourth most lethal cancer globally, with a dismal 5-year survival rate of approximately 20%.
This study employs Machine Learning and Deep Learning techniques to classify histological images into healthy and cancerous categories.
arXiv Detail & Related papers (2025-03-21T12:46:22Z) - Biomarker based Cancer Classification using an Ensemble with Pre-trained Models [2.2436844508175224]
We propose a novel ensemble model combining pre-trained Hyperfast model, XGBoost, and LightGBM for multi-class classification tasks.
We leverage a meta-trained Hyperfast model for classifying cancer, accomplishing the highest AUC of 0.9929.
We also propose a novel ensemble model combining pre-trained Hyperfast model, XGBoost, and LightGBM for multi-class classification tasks, achieving an incremental increase in accuracy (0.9464)
arXiv Detail & Related papers (2024-06-14T14:43:59Z) - Adaptive Fusion of Radiomics and Deep Features for Lung Adenocarcinoma Subtype Recognition [17.909368834829156]
The most common type of lung cancer, lung adenocarcinoma (LUAD), has been increasingly detected since the advent of low-dose computed tomography screening technology.
In clinical practice, pre-invasive LUAD (Pre-IAs) should only require regular follow-up care, while invasive LUAD (IAs) should receive immediate treatment with appropriate lung cancer resection, based on the cancer subtype.
arXiv Detail & Related papers (2023-08-27T03:54:55Z) - DEDUCE: Multi-head attention decoupled contrastive learning to discover cancer subtypes based on multi-omics data [7.049723871585993]
We propose a model, named DEDUCE, for unsupervised contrastive learning to analyze multi-omics cancer data.
This model adopts a unsupervised SMAE that can deeply extract contextual features and long-range dependencies from multi-omics data.
Subtypes are clustered by calculating the similarity between samples in both the feature space and sample space of multi-omics data.
arXiv Detail & Related papers (2023-07-09T00:53:23Z) - Improving Precancerous Case Characterization via Transformer-based
Ensemble Learning [31.891340667123124]
The application of natural language processing to cancer pathology reports has been focused on detecting cancer cases.
Improving the characterization of precancerous adenomas assists in developing diagnostic tests for early cancer detection and prevention.
Our results demonstrated the potential of using NLP to leverage real-world health record data to facilitate the development of diagnostic tests for early cancer prevention.
arXiv Detail & Related papers (2022-12-10T00:06:28Z) - Gene selection from microarray expression data: A Multi-objective PSO
with adaptive K-nearest neighborhood [0.0]
This paper deals with the classification problem of human cancer diseases by using gene expression data.
It is presented a new methodology to analyze microarray datasets and efficiently classify cancer diseases.
arXiv Detail & Related papers (2022-05-27T04:22:10Z) - A Comparative Study of Gastric Histopathology Sub-size Image
Classification: from Linear Regression to Visual Transformer [25.66209350064889]
Gastric cancer is the fifth most common cancer in the world.
Computer technology has advanced rapidly to assist physicians in the diagnosis of gastric cancer.
arXiv Detail & Related papers (2022-05-25T15:13:08Z) - Cancer Gene Profiling through Unsupervised Discovery [49.28556294619424]
We introduce a novel, automatic and unsupervised framework to discover low-dimensional gene biomarkers.
Our method is based on the LP-Stability algorithm, a high dimensional center-based unsupervised clustering algorithm.
Our signature reports promising results on distinguishing immune inflammatory and immune desert tumors.
arXiv Detail & Related papers (2021-02-11T09:04:45Z) - Topological Data Analysis of copy number alterations in cancer [70.85487611525896]
We explore the potential to capture information contained in cancer genomic information using a novel topology-based approach.
We find that this technique has the potential to extract meaningful low-dimensional representations in cancer somatic genetic data.
arXiv Detail & Related papers (2020-11-22T17:31:23Z) - Sickle-cell disease diagnosis support selecting the most appropriate
machinelearning method: Towards a general and interpretable approach for
cellmorphology analysis from microscopy images [0.0]
We propose an approach to select the classification method and features, based on the state-of-the-art.
We used samples of patients with sickle-cell disease which can be generalized for other study cases.
arXiv Detail & Related papers (2020-10-09T11:46:38Z) - Harvesting, Detecting, and Characterizing Liver Lesions from Large-scale
Multi-phase CT Data via Deep Dynamic Texture Learning [24.633802585888812]
We propose a fully-automated and multi-stage liver tumor characterization framework for dynamic contrast computed tomography (CT)
Our system comprises four sequential processes of tumor proposal detection, tumor harvesting, primary tumor site selection, and deep texture-based tumor characterization.
arXiv Detail & Related papers (2020-06-28T19:55:34Z) - The scalable Birth-Death MCMC Algorithm for Mixed Graphical Model
Learning with Application to Genomic Data Integration [0.0]
We propose a novel mixed graphical model approach to analyze multi-omic data of different types.
We find that our method is superior in terms of both computational efficiency and the accuracy of the model selection results.
arXiv Detail & Related papers (2020-05-08T16:34:58Z) - Analysis of ensemble feature selection for correlated high-dimensional
RNA-Seq cancer data [0.24366811507669126]
This study compares two approaches for the discovery of relevant variables.
The most informative features are identified using a four feature selection algorithms.
Unfortunately, models built on feature sets obtained from the ensemble of feature selection algorithms were no better than for models developed on feature sets obtained from individual algorithms.
arXiv Detail & Related papers (2020-04-28T20:38:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.