A Comparative Analysis of Gene Expression Profiling by Statistical and
Machine Learning Approaches
- URL: http://arxiv.org/abs/2402.00926v1
- Date: Thu, 1 Feb 2024 18:17:36 GMT
- Title: A Comparative Analysis of Gene Expression Profiling by Statistical and
Machine Learning Approaches
- Authors: Myriam Bontonou, Ana\"is Haget, Maria Boulougouri, Benjamin Audit,
Pierre Borgnat, Jean-Michel Arbona
- Abstract summary: We discuss the biological and the methodological limitations of machine learning models to classify cancer samples.
Gene rankings are obtained from explainability methods adapted to these models.
We observe that the information learned by black-box neural networks is related to the notion of differential expression.
- Score: 1.8954222800767324
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many machine learning models have been proposed to classify phenotypes from
gene expression data. In addition to their good performance, these models can
potentially provide some understanding of phenotypes by extracting explanations
for their decisions. These explanations often take the form of a list of genes
ranked in order of importance for the predictions, the highest-ranked genes
being interpreted as linked to the phenotype. We discuss the biological and the
methodological limitations of such explanations. Experiments are performed on
several datasets gathering cancer and healthy tissue samples from the TCGA,
GTEx and TARGET databases. A collection of machine learning models including
logistic regression, multilayer perceptron, and graph neural network are
trained to classify samples according to their cancer type. Gene rankings are
obtained from explainability methods adapted to these models, and compared to
the ones from classical statistical feature selection methods such as mutual
information, DESeq2, and EdgeR. Interestingly, on simple tasks, we observe that
the information learned by black-box neural networks is related to the notion
of differential expression. In all cases, a small set containing the
best-ranked genes is sufficient to achieve a good classification. However,
these genes differ significantly between the methods and similar classification
performance can be achieved with numerous lower ranked genes. In conclusion,
although these methods enable the identification of biomarkers characteristic
of certain pathologies, our results question the completeness of the selected
gene sets and thus of explainability by the identification of the underlying
biological processes.
Related papers
- Predicting Genetic Mutation from Whole Slide Images via Biomedical-Linguistic Knowledge Enhanced Multi-label Classification [119.13058298388101]
We develop a Biological-knowledge enhanced PathGenomic multi-label Transformer to improve genetic mutation prediction performances.
BPGT first establishes a novel gene encoder that constructs gene priors by two carefully designed modules.
BPGT then designs a label decoder that finally performs genetic mutation prediction by two tailored modules.
arXiv Detail & Related papers (2024-06-05T06:42:27Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - MuSe-GNN: Learning Unified Gene Representation From Multimodal
Biological Graph Data [22.938437500266847]
We introduce a novel model called Multimodal Similarity Learning Graph Neural Network.
It combines Multimodal Machine Learning and Deep Graph Neural Networks to learn gene representations from single-cell sequencing and spatial transcriptomic data.
Our model efficiently produces unified gene representations for the analysis of gene functions, tissue functions, diseases, and species evolution.
arXiv Detail & Related papers (2023-09-29T13:33:53Z) - Studying Limits of Explainability by Integrated Gradients for Gene
Expression Models [3.220287168504093]
We show that ranking features by importance is not enough to robustly identify biomarkers.
As it is difficult to evaluate whether biomarkers reflect relevant causes without known ground truth, we simulate gene expression data by proposing a hierarchical model.
arXiv Detail & Related papers (2023-03-19T19:54:15Z) - Machine Learning Methods for Cancer Classification Using Gene Expression
Data: A Review [77.34726150561087]
Cancer is the second major cause of death after cardiovascular diseases.
Gene expression can play a fundamental role in the early detection of cancer.
This study reviews recent progress in gene expression analysis for cancer classification using machine learning methods.
arXiv Detail & Related papers (2023-01-28T15:03:03Z) - Unsupervised ensemble-based phenotyping helps enhance the
discoverability of genes related to heart morphology [57.25098075813054]
We propose a new framework for gene discovery entitled Un Phenotype Ensembles.
It builds a redundant yet highly expressive representation by pooling a set of phenotypes learned in an unsupervised manner.
These phenotypes are then analyzed via (GWAS), retaining only highly confident and stable associations.
arXiv Detail & Related papers (2023-01-07T18:36:44Z) - rfPhen2Gen: A machine learning based association study of brain imaging
phenotypes to genotypes [71.1144397510333]
We learned machine learning models to predict SNPs using 56 brain imaging QTs.
SNPs within the known Alzheimer disease (AD) risk gene APOE had lowest RMSE for lasso and random forest.
Random forests identified additional SNPs that were not prioritized by the linear models but are known to be associated with brain-related disorders.
arXiv Detail & Related papers (2022-03-31T20:15:22Z) - SimpleChrome: Encoding of Combinatorial Effects for Predicting Gene
Expression [8.326669256957352]
We present SimpleChrome, a deep learning model that learns the histone modification representations of genes.
The features learned from the model allow us to better understand the latent effects of cross-gene interactions and direct gene regulation on the target gene expression.
arXiv Detail & Related papers (2020-12-15T23:30:36Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z) - Handling highly correlated genes in prediction analysis of genomic
studies [0.0]
High correlation among genes introduces technical problems, such as multi-collinearity issues, leading to unreliable prediction models.
We propose a grouping algorithm, which treats highly correlated genes as a group and uses their common pattern to represent the group's biological signal in feature selection.
Our proposed grouping method has two advantages. First, using the gene group's common patterns makes the prediction more robust and reliable under condition change.
arXiv Detail & Related papers (2020-07-05T22:14:03Z) - A New Gene Selection Algorithm using Fuzzy-Rough Set Theory for Tumor
Classification [0.0]
We present a new technique for gene selection using a discernibility matrix of fuzzy-rough sets.
The proposed technique takes into account the similarity of those instances that have the same and different class labels to improve the gene selection results.
Experimental results demonstrate that this technique provides better efficiency compared to the state-of-the-art approaches.
arXiv Detail & Related papers (2020-03-26T13:43:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.