PhyloTransformer: A Discriminative Model for Mutation Prediction Based
on a Multi-head Self-attention Mechanism
- URL: http://arxiv.org/abs/2111.01969v1
- Date: Wed, 3 Nov 2021 01:30:57 GMT
- Title: PhyloTransformer: A Discriminative Model for Mutation Prediction Based
on a Multi-head Self-attention Mechanism
- Authors: Yingying Wu, Shusheng Xu, Shing-Tung Yau, Yi Wu
- Abstract summary: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused an ongoing pandemic infecting 219 million people as of 10/19/21, with a 3.6% mortality rate.
Here we developed PhyloTransformer, a Transformer-based discriminative model that engages a multi-head self-attention mechanism to model genetic mutations that may lead to viral reproductive advantage.
- Score: 10.468453827172477
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused an
ongoing pandemic infecting 219 million people as of 10/19/21, with a 3.6%
mortality rate. Natural selection can generate favorable mutations with
improved fitness advantages; however, the identified coronaviruses may be the
tip of the iceberg, and potentially more fatal variants of concern (VOCs) may
emerge over time. Understanding the patterns of emerging VOCs and forecasting
mutations that may lead to gain of function or immune escape is urgently
required. Here we developed PhyloTransformer, a Transformer-based
discriminative model that engages a multi-head self-attention mechanism to
model genetic mutations that may lead to viral reproductive advantage. In order
to identify complex dependencies between the elements of each input sequence,
PhyloTransformer utilizes advanced modeling techniques, including a novel Fast
Attention Via positive Orthogonal Random features approach (FAVOR+) from
Performer, and the Masked Language Model (MLM) from Bidirectional Encoder
Representations from Transformers (BERT). PhyloTransformer was trained with
1,765,297 genetic sequences retrieved from the Global Initiative for Sharing
All Influenza Data (GISAID) database. Firstly, we compared the prediction
accuracy of novel mutations and novel combinations using extensive baseline
models; we found that PhyloTransformer outperformed every baseline method with
statistical significance. Secondly, we examined predictions of mutations in
each nucleotide of the receptor binding motif (RBM), and we found our
predictions were precise and accurate. Thirdly, we predicted modifications of
N-glycosylation sites to identify mutations associated with altered
glycosylation that may be favored during viral evolution. We anticipate that
PhyloTransformer may guide proactive vaccine design for effective targeting of
future SARS-CoV-2 variants.
Related papers
- Opponent Shaping for Antibody Development [49.26728828005039]
Anti-viral therapies are typically designed to target only the current strains of a virus.
therapy-induced selective pressures act on viruses to drive the emergence of mutated strains, against which initial therapies have reduced efficacy.
We build on a computational model of binding between antibodies and viral antigens to implement a genetic simulation of viral evolutionary escape.
arXiv Detail & Related papers (2024-09-16T14:56:27Z) - Predicting Genetic Mutation from Whole Slide Images via Biomedical-Linguistic Knowledge Enhanced Multi-label Classification [119.13058298388101]
We develop a Biological-knowledge enhanced PathGenomic multi-label Transformer to improve genetic mutation prediction performances.
BPGT first establishes a novel gene encoder that constructs gene priors by two carefully designed modules.
BPGT then designs a label decoder that finally performs genetic mutation prediction by two tailored modules.
arXiv Detail & Related papers (2024-06-05T06:42:27Z) - Predicting loss-of-function impact of genetic mutations: a machine
learning approach [0.0]
This paper aims to train machine learning models on the attributes of a genetic mutation to predict LoFtool scores.
These attributes included, but were not limited to, the position of a mutation on a chromosome, changes in amino acids, and changes in codons caused by the mutation.
Models were evaluated using five-fold cross-validated averages of r-squared, mean squared error, root mean squared error, mean absolute error, and explained variance.
arXiv Detail & Related papers (2024-01-26T19:27:38Z) - Dense Feature Memory Augmented Transformers for COVID-19 Vaccination
Search Classification [60.49594822215981]
This paper presents a classification model for detecting COVID-19 vaccination related search queries.
We propose a novel approach of considering dense features as memory tokens that the model can attend to.
We show that this new modeling approach enables a significant improvement to the Vaccine Search Insights (VSI) task.
arXiv Detail & Related papers (2022-12-16T13:57:41Z) - InForecaster: Forecasting Influenza Hemagglutinin Mutations Through the
Lens of Anomaly Detection [3.5213888068272197]
anomaly detection (AD) is a well-established field in Machine Learning (ML)
We propose to tackle this challenge through anomaly detection (AD)
We conduct a large number of experiments on four publicly available datasets.
arXiv Detail & Related papers (2022-10-25T02:08:09Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - Using Deep Learning Sequence Models to Identify SARS-CoV-2 Divergence [1.9573380763700707]
SARS-CoV-2 is an upper respiratory system RNA virus that has caused over 3 million deaths and infecting over 150 million worldwide as of May 2021.
We propose a neural network model that leverages recurrent and convolutional units to take in amino acid sequences of spike proteins and classify corresponding clades.
arXiv Detail & Related papers (2021-11-12T07:52:11Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z) - Effective and scalable clustering of SARS-CoV-2 sequences [0.41998444721319206]
SARS-CoV-2 continues to mutate as it spreads, according to an evolutionary process.
The number of currently available sequences of SARS-CoV-2 in public databases such as GISAID is already several million.
We propose an approach based on clustering sequences to identify the current major SARS-CoV-2 variants.
arXiv Detail & Related papers (2021-08-18T13:32:43Z) - A k-mer Based Approach for SARS-CoV-2 Variant Identification [55.78588835407174]
We show that preserving the order of the amino acids helps the underlying classifiers to achieve better performance.
We also show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA's Centers for Disease Control and Prevention (CDC)
arXiv Detail & Related papers (2021-08-07T15:08:15Z) - MutaGAN: A Seq2seq GAN Framework to Predict Mutations of Evolving
Protein Populations [0.0]
Influenza virus sequences were identified as an ideal test case for this deep learning framework.
MutaGAN generated "child" sequences from a given "parent" protein sequence with a median Levenshtein distance of 2.00 amino acids.
Results demonstrate the power of the MutaGAN framework to aid in pathogen forecasting with implications for broad utility in evolutionary prediction for any protein population.
arXiv Detail & Related papers (2020-08-26T20:20:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.