Related papers: PhyloTransformer: A Discriminative Model for Mutation Prediction Based on a Multi-head Self-attention Mechanism

PhyloTransformer: A Discriminative Model for Mutation Prediction Based on a Multi-head Self-attention Mechanism

URL: http://arxiv.org/abs/2111.01969v1
Date: Wed, 3 Nov 2021 01:30:57 GMT
Title: PhyloTransformer: A Discriminative Model for Mutation Prediction Based on a Multi-head Self-attention Mechanism
Authors: Yingying Wu, Shusheng Xu, Shing-Tung Yau, Yi Wu
Abstract summary: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused an ongoing pandemic infecting 219 million people as of 10/19/21, with a 3.6% mortality rate. Here we developed PhyloTransformer, a Transformer-based discriminative model that engages a multi-head self-attention mechanism to model genetic mutations that may lead to viral reproductive advantage.
Score: 10.468453827172477
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused an ongoing pandemic infecting 219 million people as of 10/19/21, with a 3.6% mortality rate. Natural selection can generate favorable mutations with improved fitness advantages; however, the identified coronaviruses may be the tip of the iceberg, and potentially more fatal variants of concern (VOCs) may emerge over time. Understanding the patterns of emerging VOCs and forecasting mutations that may lead to gain of function or immune escape is urgently required. Here we developed PhyloTransformer, a Transformer-based discriminative model that engages a multi-head self-attention mechanism to model genetic mutations that may lead to viral reproductive advantage. In order to identify complex dependencies between the elements of each input sequence, PhyloTransformer utilizes advanced modeling techniques, including a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+) from Performer, and the Masked Language Model (MLM) from Bidirectional Encoder Representations from Transformers (BERT). PhyloTransformer was trained with 1,765,297 genetic sequences retrieved from the Global Initiative for Sharing All Influenza Data (GISAID) database. Firstly, we compared the prediction accuracy of novel mutations and novel combinations using extensive baseline models; we found that PhyloTransformer outperformed every baseline method with statistical significance. Secondly, we examined predictions of mutations in each nucleotide of the receptor binding motif (RBM), and we found our predictions were precise and accurate. Thirdly, we predicted modifications of N-glycosylation sites to identify mutations associated with altered glycosylation that may be favored during viral evolution. We anticipate that PhyloTransformer may guide proactive vaccine design for effective targeting of future SARS-CoV-2 variants.

Related papers

PETRA: Pretrained Evolutionary Transformer for SARS-CoV-2 Mutation Prediction [6.649916450582501]
SARS-CoV-2 has demonstrated a rapid and unpredictable evolutionary trajectory.<n>This poses persistent challenges to public health and vaccine development.<n>We introduce PETRA, a novel transformer approach based on evolutionary trajectories derived from phylogenetic trees.
arXiv Detail & Related papers (2025-11-06T01:58:23Z)
EnTao-GPM: DNA Foundation Model for Predicting the Germline Pathogenic Mutations [16.32431932781823]
Cross-species targeted pre-training on disease-relevant mammalian genomes (human, pig, mouse)<n> Germline mutation specialization via fine-tuning on ClinVar and HGMD.<n>Interpretable clinical framework integrating DNA sequence embeddings with LLM-based statistical explanations.
arXiv Detail & Related papers (2025-07-29T11:34:41Z)
Unlasting: Unpaired Single-Cell Multi-Perturbation Estimation by Dual Conditional Diffusion Implicit Bridges [68.98973318553983]
We propose a framework based on Dual Diffusion Implicit Bridges (DDIB) to learn the mapping between different data distributions.<n>We integrate gene regulatory network (GRN) information to propagate perturbation signals in a biologically meaningful way.<n>We also incorporate a masking mechanism to predict silent genes, improving the quality of generated profiles.
arXiv Detail & Related papers (2025-06-26T09:05:38Z)
JanusDDG: A Thermodynamics-Compliant Model for Sequence-Based Protein Stability via Two-Fronts Multi-Head Attention [0.0]
understanding how residue variations affect protein stability is crucial for designing functional proteins. Recent advances in protein language models (PLMs) have revolutionized computational protein analysis. We introduce JanusDDG, a deep learning framework that leverages PLM-derived embeddings and a bidirectional cross-attention transformer architecture.
arXiv Detail & Related papers (2025-04-04T09:02:32Z)
A Simple yet Effective DDG Predictor is An Unsupervised Antibody Optimizer and Explainer [53.85265022754878]
We propose a lightweight DDG predictor (Light-DDG) for fast mutation screening. We also release a large-scale dataset containing millions of mutation data for pre-training Light-DDG. For the target antibody, we propose a novel Mutation Explainer to learn mutation preferences.
arXiv Detail & Related papers (2025-02-10T09:26:57Z)
G2PDiffusion: Cross-Species Genotype-to-Phenotype Prediction via Evolutionary Diffusion [108.94237816552024]
We propose the first genotype-to-phenotype diffusion model (G2PDiffusion) that generates morphological images from DNA. The model contains three novel components: 1) a MSA retrieval engine that identifies conserved and co-evolutionary patterns; 2) an environment-aware MSA conditional encoder that effectively models complex genotype-environment interactions; and 3) an adaptive phenomic alignment module to improve genotype-phenotype consistency.
arXiv Detail & Related papers (2025-02-07T06:16:31Z)
VirusT5: Harnessing Large Language Models to Predicting SARS-CoV-2 Evolution [0.0]
We harnessed the power of Large Language Models to predict the evolution of SARS-CoV-2. We trained a transformer model, called VirusT5, to capture the mutation patterns underlying SARS-CoV-2 evolution.
arXiv Detail & Related papers (2024-12-20T08:46:42Z)
Opponent Shaping for Antibody Development [49.26728828005039]
Anti-viral therapies are typically designed to target only the current strains of a virus. therapy-induced selective pressures act on viruses to drive the emergence of mutated strains, against which initial therapies have reduced efficacy. We build on a computational model of binding between antibodies and viral antigens to implement a genetic simulation of viral evolutionary escape.
arXiv Detail & Related papers (2024-09-16T14:56:27Z)
Predicting Genetic Mutation from Whole Slide Images via Biomedical-Linguistic Knowledge Enhanced Multi-label Classification [119.13058298388101]
We develop a Biological-knowledge enhanced PathGenomic multi-label Transformer to improve genetic mutation prediction performances. BPGT first establishes a novel gene encoder that constructs gene priors by two carefully designed modules. BPGT then designs a label decoder that finally performs genetic mutation prediction by two tailored modules.
arXiv Detail & Related papers (2024-06-05T06:42:27Z)
Predicting loss-of-function impact of genetic mutations: a machine learning approach [0.0]
This paper aims to train machine learning models on the attributes of a genetic mutation to predict LoFtool scores. These attributes included, but were not limited to, the position of a mutation on a chromosome, changes in amino acids, and changes in codons caused by the mutation. Models were evaluated using five-fold cross-validated averages of r-squared, mean squared error, root mean squared error, mean absolute error, and explained variance.
arXiv Detail & Related papers (2024-01-26T19:27:38Z)
Dense Feature Memory Augmented Transformers for COVID-19 Vaccination Search Classification [60.49594822215981]
This paper presents a classification model for detecting COVID-19 vaccination related search queries. We propose a novel approach of considering dense features as memory tokens that the model can attend to. We show that this new modeling approach enables a significant improvement to the Vaccine Search Insights (VSI) task.
arXiv Detail & Related papers (2022-12-16T13:57:41Z)
InForecaster: Forecasting Influenza Hemagglutinin Mutations Through the Lens of Anomaly Detection [3.5213888068272197]
anomaly detection (AD) is a well-established field in Machine Learning (ML) We propose to tackle this challenge through anomaly detection (AD) We conduct a large number of experiments on four publicly available datasets.
arXiv Detail & Related papers (2022-10-25T02:08:09Z)
Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z)
Using Deep Learning Sequence Models to Identify SARS-CoV-2 Divergence [1.9573380763700707]
SARS-CoV-2 is an upper respiratory system RNA virus that has caused over 3 million deaths and infecting over 150 million worldwide as of May 2021. We propose a neural network model that leverages recurrent and convolutional units to take in amino acid sequences of spike proteins and classify corresponding clades.
arXiv Detail & Related papers (2021-11-12T07:52:11Z)
Multi-modal Self-supervised Pre-training for Regulatory Genome Across Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT. We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z)
Effective and scalable clustering of SARS-CoV-2 sequences [0.41998444721319206]
SARS-CoV-2 continues to mutate as it spreads, according to an evolutionary process. The number of currently available sequences of SARS-CoV-2 in public databases such as GISAID is already several million. We propose an approach based on clustering sequences to identify the current major SARS-CoV-2 variants.
arXiv Detail & Related papers (2021-08-18T13:32:43Z)
A k-mer Based Approach for SARS-CoV-2 Variant Identification [55.78588835407174]
We show that preserving the order of the amino acids helps the underlying classifiers to achieve better performance. We also show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA's Centers for Disease Control and Prevention (CDC)
arXiv Detail & Related papers (2021-08-07T15:08:15Z)
MutaGAN: A Seq2seq GAN Framework to Predict Mutations of Evolving Protein Populations [0.0]
Influenza virus sequences were identified as an ideal test case for this deep learning framework. MutaGAN generated "child" sequences from a given "parent" protein sequence with a median Levenshtein distance of 2.00 amino acids. Results demonstrate the power of the MutaGAN framework to aid in pathogen forecasting with implications for broad utility in evolutionary prediction for any protein population.
arXiv Detail & Related papers (2020-08-26T20:20:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.