AntigenLM: Structure-Aware DNA Language Modeling for Influenza
- URL: http://arxiv.org/abs/2602.09067v1
- Date: Mon, 09 Feb 2026 08:52:04 GMT
- Title: AntigenLM: Structure-Aware DNA Language Modeling for Influenza
- Authors: Yue Pei, Xuebin Chi, Yu Kang,
- Abstract summary: We present AntigenLM, a generative DNA language model pretrained on influenza genomes with intact, aligned functional units.<n>AntigenLM accurately forecasts future antigenic variants across regions and subtypes, including those unseen during training.<n>It also achieves near-perfect subtype classification.
- Score: 5.938702748853349
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models have advanced sequence analysis, yet DNA foundation models often lag behind task-specific methods for unclear reasons. We present AntigenLM, a generative DNA language model pretrained on influenza genomes with intact, aligned functional units. This structure-aware pretraining enables AntigenLM to capture evolutionary constraints and generalize across tasks. Fine-tuned on time-series hemagglutinin (HA) and neuraminidase (NA) sequences, AntigenLM accurately forecasts future antigenic variants across regions and subtypes, including those unseen during training, outperforming phylogenetic and evolution-based models. It also achieves near-perfect subtype classification. Ablation studies show that disrupting genomic structure through fragmentation or shuffling severely degrades performance, revealing the importance of preserving functional-unit integrity in DNA language modeling. AntigenLM thus provides both a powerful framework for antigen evolution prediction and a general principle for building biologically grounded DNA foundation models.
Related papers
- Exploring Protein Language Model Architecture-Induced Biases for Antibody Comprehension [24.38887522188594]
We investigate how architectural choices in protein language models (PLMs) influence their ability to comprehend antibody sequence characteristics and functions.<n>We evaluate three state-of-the-art PLMs-AntiBERTa, BioBERT, and ESM2--against a general-purpose language model (GPT-2) baseline on antibody target specificity prediction tasks.<n>Our results demonstrate that while all PLMs achieve high classification accuracy, they exhibit distinct biases in capturing biological features such as V gene usage, somatic hypermutation patterns, and isotype information.
arXiv Detail & Related papers (2025-12-10T18:22:51Z) - Evaluating DNA function understanding in genomic language models using evolutionarily implausible sequences [0.25489046505746704]
We introduce a benchmark called Nullsettes, which assesses how well models can predict in silico loss-of-function (LOF) mutations.<n>We find that most fail to consistently detect strong LOF mutations.<n>All models show a sharp drop in predictive accuracy as the likelihood assigned to the original (nonmutant) sequence decreases.
arXiv Detail & Related papers (2025-06-12T01:28:04Z) - UniGenX: a unified generative foundation model that couples sequence, structure and function to accelerate scientific design across proteins, molecules and materials [62.72989417755985]
We present UniGenX, a unified generative model for function in natural systems.<n>UniGenX represents heterogeneous inputs as a mixed stream of symbolic and numeric tokens.<n>It achieves state-of-the-art or competitive performance for the function-aware generation across domains.
arXiv Detail & Related papers (2025-03-09T16:43:07Z) - A Phylogenetic Approach to Genomic Language Modeling [0.2912705470788796]
We introduce a novel framework for training gLMs that explicitly models nucleotide evolution on phylogenetic trees.<n>Our approach integrates an alignment into the loss function during training but does not require it for making predictions.<n>We applied this framework to train PhyloGPN, a model that excels at predicting functionally disruptive variants from a single sequence alone.
arXiv Detail & Related papers (2025-03-04T06:53:03Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - S$^2$ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning [8.059724314850799]
Antibodies safeguard our health through their precise and potent binding to specific antigens, demonstrating promising therapeutic efficacy in the treatment of numerous diseases, including COVID-19.
Recent advancements in biomedical language models have shown the great potential to interpret complex biological structures and functions.
This paper proposes Sequence-Structure multi-level pre-trained antibody Language Model (S$2$ALM), combining holistic sequential and structural information in one unified, generic antibody foundation model.
arXiv Detail & Related papers (2024-11-20T14:24:26Z) - Diffusion Language Models Are Versatile Protein Learners [75.98083311705182]
This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences.
We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework.
After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation.
arXiv Detail & Related papers (2024-02-28T18:57:56Z) - xTrimoABFold: De novo Antibody Structure Prediction without MSA [77.47606749555686]
We develop a novel model named xTrimoABFold to predict antibody structure from antibody sequence.
The model was trained end-to-end on the antibody structures in PDB by minimizing the ensemble loss of domain-specific focal loss on CDR and the frame-aligned point loss.
arXiv Detail & Related papers (2022-11-30T09:26:08Z) - Incorporating Pre-training Paradigm for Antibody Sequence-Structure
Co-design [134.65287929316673]
Deep learning-based computational antibody design has attracted popular attention since it automatically mines the antibody patterns from data that could be complementary to human experiences.
The computational methods heavily rely on high-quality antibody structure data, which is quite limited.
Fortunately, there exists a large amount of sequence data of antibodies that can help model the CDR and alleviate the reliance on structure data.
arXiv Detail & Related papers (2022-10-26T15:31:36Z) - Reprogramming Pretrained Language Models for Antibody Sequence Infilling [72.13295049594585]
Computational design of antibodies involves generating novel and diverse sequences, while maintaining structural consistency.
Recent deep learning models have shown impressive results, however the limited number of known antibody sequence/structure pairs frequently leads to degraded performance.
In our work we address this challenge by leveraging Model Reprogramming (MR), which repurposes pretrained models on a source language to adapt to the tasks that are in a different language and have scarce data.
arXiv Detail & Related papers (2022-10-05T20:44:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.