Related papers: Predicting loss-of-function impact of genetic mutations: a machine learning approach

Predicting loss-of-function impact of genetic mutations: a machine learning approach

URL: http://arxiv.org/abs/2402.00054v1
Date: Fri, 26 Jan 2024 19:27:38 GMT
Title: Predicting loss-of-function impact of genetic mutations: a machine learning approach
Authors: Arshmeet Kaur and Morteza Sarmadi
Abstract summary: This paper aims to train machine learning models on the attributes of a genetic mutation to predict LoFtool scores. These attributes included, but were not limited to, the position of a mutation on a chromosome, changes in amino acids, and changes in codons caused by the mutation. Models were evaluated using five-fold cross-validated averages of r-squared, mean squared error, root mean squared error, mean absolute error, and explained variance.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The innovation of next-generation sequencing (NGS) techniques has significantly reduced the price of genome sequencing, lowering barriers to future medical research; it is now feasible to apply genome sequencing to studies where it would have previously been cost-inefficient. Identifying damaging or pathogenic mutations in vast amounts of complex, high-dimensional genome sequencing data may be of particular interest to researchers. Thus, this paper's aims were to train machine learning models on the attributes of a genetic mutation to predict LoFtool scores (which measure a gene's intolerance to loss-of-function mutations). These attributes included, but were not limited to, the position of a mutation on a chromosome, changes in amino acids, and changes in codons caused by the mutation. Models were built using the univariate feature selection technique f-regression combined with K-nearest neighbors (KNN), Support Vector Machine (SVM), Random Sample Consensus (RANSAC), Decision Trees, Random Forest, and Extreme Gradient Boosting (XGBoost). These models were evaluated using five-fold cross-validated averages of r-squared, mean squared error, root mean squared error, mean absolute error, and explained variance. The findings of this study include the training of multiple models with testing set r-squared values of 0.97.

Related papers

Unlasting: Unpaired Single-Cell Multi-Perturbation Estimation by Dual Conditional Diffusion Implicit Bridges [68.98973318553983]
We propose a framework based on Dual Diffusion Implicit Bridges (DDIB) to learn the mapping between different data distributions.<n>We integrate gene regulatory network (GRN) information to propagate perturbation signals in a biologically meaningful way.<n>We also incorporate a masking mechanism to predict silent genes, improving the quality of generated profiles.
arXiv Detail & Related papers (2025-06-26T09:05:38Z)
Improving statistical learning methods via features selection without replacement sampling and random projection [0.680740878601496]
Cancer is a genetic disease characterized by genetic and epigenetic alterations that disrupt normal gene expression.<n>High-dimensional microarray datasets pose challenges for classification models due to the "small n, large p" problem.<n>This study contributes to cancer biomarker discovery, offering a robust computational method for analyzing microarray data.
arXiv Detail & Related papers (2025-05-28T22:36:46Z)
GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype [51.58774936662233]
Building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations.<n>In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data.<n>We introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes.
arXiv Detail & Related papers (2025-05-06T03:35:24Z)
Predicting Genetic Mutation from Whole Slide Images via Biomedical-Linguistic Knowledge Enhanced Multi-label Classification [119.13058298388101]
We develop a Biological-knowledge enhanced PathGenomic multi-label Transformer to improve genetic mutation prediction performances. BPGT first establishes a novel gene encoder that constructs gene priors by two carefully designed modules. BPGT then designs a label decoder that finally performs genetic mutation prediction by two tailored modules.
arXiv Detail & Related papers (2024-06-05T06:42:27Z)
VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z)
Comparative Analysis of Data Preprocessing Methods, Feature Selection Techniques and Machine Learning Models for Improved Classification and Regression Performance on Imbalanced Genetic Data [0.0]
We investigated the effects of data preprocessing, feature selection techniques, and model selection on the performance of models trained on genetic datasets. We found that outliers/skew in predictor or target variables did not pose a challenge to regression models. We also found that class-imbalanced target variables and skewed predictors had little to no impact on classification performance.
arXiv Detail & Related papers (2024-02-22T21:41:27Z)
Efficient and Scalable Fine-Tune of Language Models for Genome Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes. Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues. textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z)
Genetic heterogeneity analysis using genetic algorithm and network science [2.6166087473624318]
Genome-wide association studies (GWAS) can identify disease susceptible genetic variables. Genetic variables intertwined with genetic effects often exhibit lower effect-size. This paper introduces a novel feature selection mechanism for GWAS, named Feature Co-selection Network (FCSNet)
arXiv Detail & Related papers (2023-08-12T01:28:26Z)
Score-based Causal Representation Learning with Interventions [54.735484409244386]
This paper studies the causal representation learning problem when latent causal variables are observed indirectly. The objectives are: (i) recovering the unknown linear transformation (up to scaling) and (ii) determining the directed acyclic graph (DAG) underlying the latent variables.
arXiv Detail & Related papers (2023-01-19T18:39:48Z)
Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z)
rfPhen2Gen: A machine learning based association study of brain imaging phenotypes to genotypes [71.1144397510333]
We learned machine learning models to predict SNPs using 56 brain imaging QTs. SNPs within the known Alzheimer disease (AD) risk gene APOE had lowest RMSE for lasso and random forest. Random forests identified additional SNPs that were not prioritized by the linear models but are known to be associated with brain-related disorders.
arXiv Detail & Related papers (2022-03-31T20:15:22Z)
High-dimensional multi-trait GWAS by reverse prediction of genotypes [3.441021278275805]
Reverse regression is a promising approach to perform multi-trait GWAS in high-dimensional settings. We analyzed different machine learning methods for reverse regression in multi-trait GWAS. Model feature coefficients correlated with the strength of association between variants and individual traits, and were predictive of true trans-eQTL target genes.
arXiv Detail & Related papers (2021-10-29T22:34:35Z)
Deep neural networks with controlled variable selection for the identification of putative causal genetic variants [0.43012765978447565]
We propose an interpretable neural network model, stabilized using ensembling, with controlled variable selection for genetic studies. The merit of the proposed method includes: (1) flexible modelling of the non-linear effect of genetic variants to improve statistical power; (2) multiple knockoffs in the input layer to rigorously control false discovery rate; (3) hierarchical layers to substantially reduce the number of weight parameters and activations to improve computational efficiency.
arXiv Detail & Related papers (2021-09-29T20:57:48Z)
Mycorrhiza: Genotype Assignment usingPhylogenetic Networks [2.286041284499166]
We introduce Mycorrhiza, a machine learning approach for the genotype assignment problem. Our algorithm makes use of phylogenetic networks to engineer features that encode the evolutionary relationships among samples. Mycorrhiza yields particularly significant gains on datasets with a large average fixation index (FST) or deviation from the Hardy-Weinberg equilibrium.
arXiv Detail & Related papers (2020-10-14T02:36:27Z)
Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients. We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks. Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.