Handling highly correlated genes in prediction analysis of genomic
studies
- URL: http://arxiv.org/abs/2007.02455v4
- Date: Fri, 8 Apr 2022 01:04:27 GMT
- Title: Handling highly correlated genes in prediction analysis of genomic
studies
- Authors: Li Xing, Songwan Joun, Kurt Mackay, Mary Lesperance, and Xuekui Zhang
- Abstract summary: High correlation among genes introduces technical problems, such as multi-collinearity issues, leading to unreliable prediction models.
We propose a grouping algorithm, which treats highly correlated genes as a group and uses their common pattern to represent the group's biological signal in feature selection.
Our proposed grouping method has two advantages. First, using the gene group's common patterns makes the prediction more robust and reliable under condition change.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Background: Selecting feature genes to predict phenotypes is one of the
typical tasks in analyzing genomics data. Though many general-purpose
algorithms were developed for prediction, dealing with highly correlated genes
in the prediction model is still not well addressed. High correlation among
genes introduces technical problems, such as multi-collinearity issues, leading
to unreliable prediction models. Furthermore, when a causal gene (whose
variants have an actual biological effect on a phenotype) is highly correlated
with other genes, most algorithms select the feature gene from the correlated
group in a purely data-driven manner. Since the correlation structure among
genes could change substantially when condition changes, the prediction model
based on not correctly selected feature genes is unreliable. Therefore, we aim
to keep the causal biological signal in the prediction process and build a more
robust prediction model.
Method: We propose a grouping algorithm, which treats highly correlated genes
as a group and uses their common pattern to represent the group's biological
signal in feature selection. Our novel grouping algorithm can be integrated
into existing prediction algorithms to enhance their prediction performance.
Our proposed grouping method has two advantages. First, using the gene group's
common patterns makes the prediction more robust and reliable under condition
change. Second, it reports whole correlated gene groups as discovered
biomarkers for prediction tasks, allowing researchers to conduct follow-up
studies to identify causal genes within the identified groups.
Result: Using real benchmark scRNA-seq datasets with simulated cell
phenotypes, we demonstrate our novel method significantly outperforms standard
models in both (1) prediction of cell phenotypes and (2) feature gene
selection.
Related papers
- GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.
The model adheres to the central dogma of molecular biology, accurately generating protein-coding sequences.
It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of promoter sequences.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - G2PDiffusion: Genotype-to-Phenotype Prediction with Diffusion Models [108.94237816552024]
This paper introduces G2PDiffusion, the first-of-its-kind diffusion model designed for genotype-to-phenotype generation across multiple species.
We use images to represent morphological phenotypes across species and redefine phenotype prediction as conditional image generation.
arXiv Detail & Related papers (2025-02-07T06:16:31Z) - Survey and Improvement Strategies for Gene Prioritization with Large Language Models [61.24568051916653]
Large language models (LLMs) have performed well in medical exams, but their effectiveness in diagnosing rare genetic diseases has not been assessed.
We used multi-agent and Human Phenotype Ontology (HPO) classification to categorized patients based on phenotypes and solvability levels.
At baseline, GPT-4 outperformed other LLMs, achieving near 30% accuracy in ranking causal genes correctly.
arXiv Detail & Related papers (2025-01-30T23:03:03Z) - Predicting Genetic Mutation from Whole Slide Images via Biomedical-Linguistic Knowledge Enhanced Multi-label Classification [119.13058298388101]
We develop a Biological-knowledge enhanced PathGenomic multi-label Transformer to improve genetic mutation prediction performances.
BPGT first establishes a novel gene encoder that constructs gene priors by two carefully designed modules.
BPGT then designs a label decoder that finally performs genetic mutation prediction by two tailored modules.
arXiv Detail & Related papers (2024-06-05T06:42:27Z) - A Comparative Analysis of Gene Expression Profiling by Statistical and
Machine Learning Approaches [1.8954222800767324]
We discuss the biological and the methodological limitations of machine learning models to classify cancer samples.
Gene rankings are obtained from explainability methods adapted to these models.
We observe that the information learned by black-box neural networks is related to the notion of differential expression.
arXiv Detail & Related papers (2024-02-01T18:17:36Z) - rfPhen2Gen: A machine learning based association study of brain imaging
phenotypes to genotypes [71.1144397510333]
We learned machine learning models to predict SNPs using 56 brain imaging QTs.
SNPs within the known Alzheimer disease (AD) risk gene APOE had lowest RMSE for lasso and random forest.
Random forests identified additional SNPs that were not prioritized by the linear models but are known to be associated with brain-related disorders.
arXiv Detail & Related papers (2022-03-31T20:15:22Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z) - SimpleChrome: Encoding of Combinatorial Effects for Predicting Gene
Expression [8.326669256957352]
We present SimpleChrome, a deep learning model that learns the histone modification representations of genes.
The features learned from the model allow us to better understand the latent effects of cross-gene interactions and direct gene regulation on the target gene expression.
arXiv Detail & Related papers (2020-12-15T23:30:36Z) - Expectile Neural Networks for Genetic Data Analysis of Complex Diseases [3.0088453915399747]
We develop an expectile neural network (ENN) method for genetic data analyses of complex diseases.
Similar to expectile regression, ENN provides a comprehensive view of relationships between genetic variants and disease phenotypes.
We show that the proposed method outperformed an existing expectile regression when there exist complex relationships between genetic variants and disease phenotypes.
arXiv Detail & Related papers (2020-10-26T21:07:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.