Related papers: Training Flexible Models of Genetic Variant Effects from Functional Annotations using Accelerated Linear Algebra

Training Flexible Models of Genetic Variant Effects from Functional Annotations using Accelerated Linear Algebra

URL: http://arxiv.org/abs/2506.19598v2
Date: Sat, 28 Jun 2025 13:26:08 GMT
Title: Training Flexible Models of Genetic Variant Effects from Functional Annotations using Accelerated Linear Algebra
Authors: Alan N. Amin, Andres Potapczynski, Andrew Gordon Wilson,
Abstract summary: We leverage modern fast linear algebra techniques to develop DeepWAS, a method to train large neural network predictive models to optimize likelihood.<n>We find larger models trained on more features make better predictions, potentially improving disease predictions and therapeutic target identification.
Score: 44.253701408005895
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To understand how genetic variants in human genomes manifest in phenotypes -- traits like height or diseases like asthma -- geneticists have sequenced and measured hundreds of thousands of individuals. Geneticists use this data to build models that predict how a genetic variant impacts phenotype given genomic features of the variant, like DNA accessibility or the presence of nearby DNA-bound proteins. As more data and features become available, one might expect predictive models to improve. Unfortunately, training these models is bottlenecked by the need to solve expensive linear algebra problems because variants in the genome are correlated with nearby variants, requiring inversion of large matrices. Previous methods have therefore been restricted to fitting small models, and fitting simplified summary statistics, rather than the full likelihood of the statistical model. In this paper, we leverage modern fast linear algebra techniques to develop DeepWAS (Deep genome Wide Association Studies), a method to train large and flexible neural network predictive models to optimize likelihood. Notably, we find that larger models only improve performance when using our full likelihood approach; when trained by fitting traditional summary statistics, larger models perform no better than small ones. We find larger models trained on more features make better predictions, potentially improving disease predictions and therapeutic target identification.

Related papers

GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype [51.58774936662233]
Building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations.<n>In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data.<n>We introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes.
arXiv Detail & Related papers (2025-05-06T03:35:24Z)
Inferring genotype-phenotype maps using attention models [0.21990652930491852]
Predicting phenotype from genotype is a central challenge in genetics.<n>Recent advances in machine learning, particularly attention-based models, offer a promising alternative.<n>Here, we apply attention-based models to quantitative genetics.
arXiv Detail & Related papers (2025-04-14T16:32:17Z)
Efficient Data Selection for Training Genomic Perturbation Models [22.722764359030176]
Gene expression models based on graph neural networks are trained to predict the outcomes of gene perturbations.<n>Active learning methods are often employed to train these models due to the cost of the experiments required to build the training set.<n>We propose graph-based one-shot data selection methods for training gene expression models.
arXiv Detail & Related papers (2025-03-18T12:52:03Z)
Scaling and renormalization in high-dimensional regression [72.59731158970894]
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models. We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning.
arXiv Detail & Related papers (2024-05-01T15:59:00Z)
Comparative Analysis of Data Preprocessing Methods, Feature Selection Techniques and Machine Learning Models for Improved Classification and Regression Performance on Imbalanced Genetic Data [0.0]
We investigated the effects of data preprocessing, feature selection techniques, and model selection on the performance of models trained on genetic datasets. We found that outliers/skew in predictor or target variables did not pose a challenge to regression models. We also found that class-imbalanced target variables and skewed predictors had little to no impact on classification performance.
arXiv Detail & Related papers (2024-02-22T21:41:27Z)
Efficient and Scalable Fine-Tune of Language Models for Genome Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes. Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues. textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z)
rfPhen2Gen: A machine learning based association study of brain imaging phenotypes to genotypes [71.1144397510333]
We learned machine learning models to predict SNPs using 56 brain imaging QTs. SNPs within the known Alzheimer disease (AD) risk gene APOE had lowest RMSE for lasso and random forest. Random forests identified additional SNPs that were not prioritized by the linear models but are known to be associated with brain-related disorders.
arXiv Detail & Related papers (2022-03-31T20:15:22Z)
Expectile Neural Networks for Genetic Data Analysis of Complex Diseases [3.0088453915399747]
We develop an expectile neural network (ENN) method for genetic data analyses of complex diseases. Similar to expectile regression, ENN provides a comprehensive view of relationships between genetic variants and disease phenotypes. We show that the proposed method outperformed an existing expectile regression when there exist complex relationships between genetic variants and disease phenotypes.
arXiv Detail & Related papers (2020-10-26T21:07:40Z)
Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients. We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks. Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
Handling highly correlated genes in prediction analysis of genomic studies [0.0]
High correlation among genes introduces technical problems, such as multi-collinearity issues, leading to unreliable prediction models. We propose a grouping algorithm, which treats highly correlated genes as a group and uses their common pattern to represent the group's biological signal in feature selection. Our proposed grouping method has two advantages. First, using the gene group's common patterns makes the prediction more robust and reliable under condition change.
arXiv Detail & Related papers (2020-07-05T22:14:03Z)
A Semi-Supervised Generative Adversarial Network for Prediction of Genetic Disease Outcomes [0.0]
We introduce genetic Generative Adversarial Networks (gGAN) to create large synthetic genetic data sets. Our goal is to determine the propensity of a new individual to develop the severe form of the illness from their genetic profile alone. The proposed model is self-aware and capable of determining whether a new genetic profile has enough compatibility with the data on which the network was trained.
arXiv Detail & Related papers (2020-07-02T15:35:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.