rfPhen2Gen: A machine learning based association study of brain imaging
phenotypes to genotypes
- URL: http://arxiv.org/abs/2204.00067v1
- Date: Thu, 31 Mar 2022 20:15:22 GMT
- Title: rfPhen2Gen: A machine learning based association study of brain imaging
phenotypes to genotypes
- Authors: Muhammad Ammar Malik, Alexander S. Lundervold and Tom Michoel
- Abstract summary: We learned machine learning models to predict SNPs using 56 brain imaging QTs.
SNPs within the known Alzheimer disease (AD) risk gene APOE had lowest RMSE for lasso and random forest.
Random forests identified additional SNPs that were not prioritized by the linear models but are known to be associated with brain-related disorders.
- Score: 71.1144397510333
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Imaging genetic studies aim to find associations between genetic variants and
imaging quantitative traits. Traditional genome-wide association studies (GWAS)
are based on univariate statistical tests, but when multiple traits are
analyzed together they suffer from a multiple-testing problem and from not
taking into account correlations among the traits. An alternative approach to
multi-trait GWAS is to reverse the functional relation between genotypes and
traits, by fitting a multivariate regression model to predict genotypes from
multiple traits simultaneously. However, current reverse genotype prediction
approaches are mostly based on linear models. Here, we evaluated random forest
regression (RFR) as a method to predict SNPs from imaging QTs and identify
biologically relevant associations. We learned machine learning models to
predict 518,484 SNPs using 56 brain imaging QTs. We observed that genotype
regression error is a better indicator of permutation p-value significance than
genotype classification accuracy. SNPs within the known Alzheimer disease (AD)
risk gene APOE had lowest RMSE for lasso and random forest, but not ridge
regression. Moreover, random forests identified additional SNPs that were not
prioritized by the linear models but are known to be associated with
brain-related disorders. Feature selection identified well-known brain regions
associated with AD,like the hippocampus and amygdala, as important predictors
of the most significant SNPs. In summary, our results indicate that non-linear
methods like random forests may offer additional insights into
phenotype-genotype associations compared to traditional linear multi-variate
GWAS methods.
Related papers
- Interpreting artificial neural networks to detect genome-wide association signals for complex traits [0.0]
Investigating the genetic architecture of complex diseases is challenging due to the highly polygenic and interactive landscape of genetic and environmental factors.
We trained artificial neural networks for predicting complex traits using both simulated and real genotype/phenotype datasets.
arXiv Detail & Related papers (2024-07-26T15:20:42Z) - Predicting Genetic Mutation from Whole Slide Images via Biomedical-Linguistic Knowledge Enhanced Multi-label Classification [119.13058298388101]
We develop a Biological-knowledge enhanced PathGenomic multi-label Transformer to improve genetic mutation prediction performances.
BPGT first establishes a novel gene encoder that constructs gene priors by two carefully designed modules.
BPGT then designs a label decoder that finally performs genetic mutation prediction by two tailored modules.
arXiv Detail & Related papers (2024-06-05T06:42:27Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank [69.90493129893112]
Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals.
Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data.
arXiv Detail & Related papers (2024-04-26T16:39:50Z) - Predicting loss-of-function impact of genetic mutations: a machine
learning approach [0.0]
This paper aims to train machine learning models on the attributes of a genetic mutation to predict LoFtool scores.
These attributes included, but were not limited to, the position of a mutation on a chromosome, changes in amino acids, and changes in codons caused by the mutation.
Models were evaluated using five-fold cross-validated averages of r-squared, mean squared error, root mean squared error, mean absolute error, and explained variance.
arXiv Detail & Related papers (2024-01-26T19:27:38Z) - Unsupervised ensemble-based phenotyping helps enhance the
discoverability of genes related to heart morphology [57.25098075813054]
We propose a new framework for gene discovery entitled Un Phenotype Ensembles.
It builds a redundant yet highly expressive representation by pooling a set of phenotypes learned in an unsupervised manner.
These phenotypes are then analyzed via (GWAS), retaining only highly confident and stable associations.
arXiv Detail & Related papers (2023-01-07T18:36:44Z) - High-dimensional multi-trait GWAS by reverse prediction of genotypes [3.441021278275805]
Reverse regression is a promising approach to perform multi-trait GWAS in high-dimensional settings.
We analyzed different machine learning methods for reverse regression in multi-trait GWAS.
Model feature coefficients correlated with the strength of association between variants and individual traits, and were predictive of true trans-eQTL target genes.
arXiv Detail & Related papers (2021-10-29T22:34:35Z) - Expectile Neural Networks for Genetic Data Analysis of Complex Diseases [3.0088453915399747]
We develop an expectile neural network (ENN) method for genetic data analyses of complex diseases.
Similar to expectile regression, ENN provides a comprehensive view of relationships between genetic variants and disease phenotypes.
We show that the proposed method outperformed an existing expectile regression when there exist complex relationships between genetic variants and disease phenotypes.
arXiv Detail & Related papers (2020-10-26T21:07:40Z) - Two-step penalised logistic regression for multi-omic data with an
application to cardiometabolic syndrome [62.997667081978825]
We implement a two-step approach to multi-omic logistic regression in which variable selection is performed on each layer separately.
Our approach should be preferred if the goal is to select as many relevant predictors as possible.
Our proposed approach allows us to identify features that characterise cardiometabolic syndrome at the molecular level.
arXiv Detail & Related papers (2020-08-01T10:36:27Z) - Handling highly correlated genes in prediction analysis of genomic
studies [0.0]
High correlation among genes introduces technical problems, such as multi-collinearity issues, leading to unreliable prediction models.
We propose a grouping algorithm, which treats highly correlated genes as a group and uses their common pattern to represent the group's biological signal in feature selection.
Our proposed grouping method has two advantages. First, using the gene group's common patterns makes the prediction more robust and reliable under condition change.
arXiv Detail & Related papers (2020-07-05T22:14:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.