PETRA: Pretrained Evolutionary Transformer for SARS-CoV-2 Mutation Prediction
- URL: http://arxiv.org/abs/2511.03976v1
- Date: Thu, 06 Nov 2025 01:58:23 GMT
- Title: PETRA: Pretrained Evolutionary Transformer for SARS-CoV-2 Mutation Prediction
- Authors: Xu Zou,
- Abstract summary: SARS-CoV-2 has demonstrated a rapid and unpredictable evolutionary trajectory.<n>This poses persistent challenges to public health and vaccine development.<n>We introduce PETRA, a novel transformer approach based on evolutionary trajectories derived from phylogenetic trees.
- Score: 6.649916450582501
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Since its emergence, SARS-CoV-2 has demonstrated a rapid and unpredictable evolutionary trajectory, characterized by the continual emergence of immune-evasive variants. This poses persistent challenges to public health and vaccine development. While large-scale generative pre-trained transformers (GPTs) have revolutionized the modeling of sequential data, their direct applications to noisy viral genomic sequences are limited. In this paper, we introduce PETRA(Pretrained Evolutionary TRAnsformer), a novel transformer approach based on evolutionary trajectories derived from phylogenetic trees rather than raw RNA sequences. This method effectively mitigates sequencing noise and captures the hierarchical structure of viral evolution. With a weighted training framework to address substantial geographical and temporal imbalances in global sequence data, PETRA excels in predicting future SARS-CoV-2 mutations, achieving a weighted recall@1 of 9.45% for nucleotide mutations and 17.10\% for spike amino-acid mutations, compared to 0.49% and 6.64% respectively for the best baseline. PETRA also demonstrates its ability to aid in the real-time mutation prediction of major clades like 24F(XEC) and 25A(LP.8.1). The code is open sourced on https://github.com/xz-keg/PETra
Related papers
- Evolutionary Profiles for Protein Fitness Prediction [45.945064429964084]
EvoIF fuses sequence-structure representations with evolutionary signals to yield calibrated probabilities for log-odds scoring.<n>On ProteinGym (217 mutational assays; >2.5M mutants), EvoIF and its MSA-enabled variant achieve state-of-the-art or competitive performance while using only 0.15% of the training depths.
arXiv Detail & Related papers (2025-10-08T17:46:02Z) - A Simple yet Effective DDG Predictor is An Unsupervised Antibody Optimizer and Explainer [53.85265022754878]
We propose a lightweight DDG predictor (Light-DDG) for fast mutation screening.<n>We also release a large-scale dataset containing millions of mutation data for pre-training Light-DDG.<n>For the target antibody, we propose a novel Mutation Explainer to learn mutation preferences.
arXiv Detail & Related papers (2025-02-10T09:26:57Z) - VirusT5: Harnessing Large Language Models to Predicting SARS-CoV-2 Evolution [0.0]
We harnessed the power of Large Language Models to predict the evolution of SARS-CoV-2.<n>We trained a transformer model, called VirusT5, to capture the mutation patterns underlying SARS-CoV-2 evolution.
arXiv Detail & Related papers (2024-12-20T08:46:42Z) - ADIOS: Antibody Development via Opponent Shaping [49.26728828005039]
Anti-viral therapies are typically designed to target only the current strains of a virus.<n>However, therapy-induced selective pressures drive the emergence of new viral strains.<n>This evolutionary response presents an opportunity: our therapies could both defend against and actively influence viral evolution.
arXiv Detail & Related papers (2024-09-16T14:56:27Z) - Learning to Predict Mutation Effects of Protein-Protein Interactions by Microenvironment-aware Hierarchical Prompt Learning [78.38442423223832]
We develop a novel codebook pre-training task, namely masked microenvironment modeling.
We demonstrate superior performance and training efficiency over state-of-the-art pre-training-based methods in mutation effect prediction.
arXiv Detail & Related papers (2024-05-16T03:53:21Z) - Using Deep Learning Sequence Models to Identify SARS-CoV-2 Divergence [1.9573380763700707]
SARS-CoV-2 is an upper respiratory system RNA virus that has caused over 3 million deaths and infecting over 150 million worldwide as of May 2021.
We propose a neural network model that leverages recurrent and convolutional units to take in amino acid sequences of spike proteins and classify corresponding clades.
arXiv Detail & Related papers (2021-11-12T07:52:11Z) - PhyloTransformer: A Discriminative Model for Mutation Prediction Based
on a Multi-head Self-attention Mechanism [10.468453827172477]
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused an ongoing pandemic infecting 219 million people as of 10/19/21, with a 3.6% mortality rate.
Here we developed PhyloTransformer, a Transformer-based discriminative model that engages a multi-head self-attention mechanism to model genetic mutations that may lead to viral reproductive advantage.
arXiv Detail & Related papers (2021-11-03T01:30:57Z) - Effective and scalable clustering of SARS-CoV-2 sequences [0.41998444721319206]
SARS-CoV-2 continues to mutate as it spreads, according to an evolutionary process.
The number of currently available sequences of SARS-CoV-2 in public databases such as GISAID is already several million.
We propose an approach based on clustering sequences to identify the current major SARS-CoV-2 variants.
arXiv Detail & Related papers (2021-08-18T13:32:43Z) - A k-mer Based Approach for SARS-CoV-2 Variant Identification [55.78588835407174]
We show that preserving the order of the amino acids helps the underlying classifiers to achieve better performance.
We also show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA's Centers for Disease Control and Prevention (CDC)
arXiv Detail & Related papers (2021-08-07T15:08:15Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - STELAR: Spatio-temporal Tensor Factorization with Latent Epidemiological
Regularization [76.57716281104938]
We develop a tensor method to predict the evolution of epidemic trends for many regions simultaneously.
STELAR enables long-term prediction by incorporating latent temporal regularization through a system of discrete-time difference equations.
We conduct experiments using both county- and state-level COVID-19 data and show that our model can identify interesting latent patterns of the epidemic.
arXiv Detail & Related papers (2020-12-08T21:21:47Z) - MutaGAN: A Seq2seq GAN Framework to Predict Mutations of Evolving
Protein Populations [0.0]
Influenza virus sequences were identified as an ideal test case for this deep learning framework.
MutaGAN generated "child" sequences from a given "parent" protein sequence with a median Levenshtein distance of 2.00 amino acids.
Results demonstrate the power of the MutaGAN framework to aid in pathogen forecasting with implications for broad utility in evolutionary prediction for any protein population.
arXiv Detail & Related papers (2020-08-26T20:20:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.