Related papers: Exploring Protein Language Model Architecture-Induced Biases for Antibody Comprehension

Exploring Protein Language Model Architecture-Induced Biases for Antibody Comprehension

URL: http://arxiv.org/abs/2512.09894v1
Date: Wed, 10 Dec 2025 18:22:51 GMT
Title: Exploring Protein Language Model Architecture-Induced Biases for Antibody Comprehension
Authors: Mengren, Liu, Yixiang Zhang, Yiming, Zhang,
Abstract summary: We investigate how architectural choices in protein language models (PLMs) influence their ability to comprehend antibody sequence characteristics and functions.<n>We evaluate three state-of-the-art PLMs-AntiBERTa, BioBERT, and ESM2--against a general-purpose language model (GPT-2) baseline on antibody target specificity prediction tasks.<n>Our results demonstrate that while all PLMs achieve high classification accuracy, they exhibit distinct biases in capturing biological features such as V gene usage, somatic hypermutation patterns, and isotype information.
Score: 24.38887522188594
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in protein language models (PLMs) have demonstrated remarkable capabilities in understanding protein sequences. However, the extent to which different model architectures capture antibody-specific biological properties remains unexplored. In this work, we systematically investigate how architectural choices in PLMs influence their ability to comprehend antibody sequence characteristics and functions. We evaluate three state-of-the-art PLMs-AntiBERTa, BioBERT, and ESM2--against a general-purpose language model (GPT-2) baseline on antibody target specificity prediction tasks. Our results demonstrate that while all PLMs achieve high classification accuracy, they exhibit distinct biases in capturing biological features such as V gene usage, somatic hypermutation patterns, and isotype information. Through attention attribution analysis, we show that antibody-specific models like AntiBERTa naturally learn to focus on complementarity-determining regions (CDRs), while general protein models benefit significantly from explicit CDR-focused training strategies. These findings provide insights into the relationship between model architecture and biological feature extraction, offering valuable guidance for future PLM development in computational antibody design.

Related papers

AntigenLM: Structure-Aware DNA Language Modeling for Influenza [5.938702748853349]
We present AntigenLM, a generative DNA language model pretrained on influenza genomes with intact, aligned functional units.<n>AntigenLM accurately forecasts future antigenic variants across regions and subtypes, including those unseen during training.<n>It also achieves near-perfect subtype classification.
arXiv Detail & Related papers (2026-02-09T08:52:04Z)
Machine learning approaches for interpretable antibody property prediction using structural data [1.406995367117218]
Understanding the relationship between antibody sequence, structure and function is essential for the design of antibody-based therapeutics and research tools.<n>Machine learning models mostly based on the application of large language models to sequence information have been developed to predict antibody properties.<n>This chapter describes two ML frameworks that integrate structural data (via graph representations) with neural networks to predict properties of antibodies.
arXiv Detail & Related papers (2025-10-28T01:13:09Z)
PRING: Rethinking Protein-Protein Interaction Prediction from Pairs to Graphs [88.98041407783502]
PRING is the first benchmark that evaluates protein-protein interaction prediction from a graph-level perspective.<n> PRING curates a high-quality, multi-species PPI network dataset comprising 21,484 proteins and 186,818 interactions.
arXiv Detail & Related papers (2025-07-07T15:21:05Z)
DISPROTBENCH: A Disorder-Aware, Task-Rich Benchmark for Evaluating Protein Structure Prediction in Realistic Biological Contexts [76.59606029593085]
DisProtBench is a benchmark for evaluating protein structure prediction models (PSPMs) under structural disorder and complex biological conditions.<n>DisProtBench spans three key axes: data complexity, task diversity, and Interpretability.<n>Results reveal significant variability in model robustness under disorder, with low-confidence regions linked to functional prediction failures.
arXiv Detail & Related papers (2025-06-18T23:58:22Z)
AbBiBench: A Benchmark for Antibody Binding Affinity Maturation and Design [8.195812610020203]
AbBiBench is a benchmarking framework for antibody binding affinity maturation and design.<n>It evaluates an antibody design's binding potential by measuring how well a protein model scores the full Ab-Ag complex.
arXiv Detail & Related papers (2025-05-23T21:09:04Z)
Relation-Aware Equivariant Graph Networks for Epitope-Unknown Antibody Design and Specificity Optimization [61.06622479173572]
We propose a novel Relation-Aware Design (RAAD) framework, which models antigen-antibody interactions for co-designing sequences and structures of antigen-specific CDRs.<n> Furthermore, we propose a new evaluation metric to better measure antibody specificity and develop a contrasting specificity-enhancing constraint to optimize the specificity of antibodies.
arXiv Detail & Related papers (2024-12-14T03:00:44Z)
S$^2$ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning [8.059724314850799]
Antibodies safeguard our health through their precise and potent binding to specific antigens, demonstrating promising therapeutic efficacy in the treatment of numerous diseases, including COVID-19. Recent advancements in biomedical language models have shown the great potential to interpret complex biological structures and functions. This paper proposes Sequence-Structure multi-level pre-trained antibody Language Model (S$2$ALM), combining holistic sequential and structural information in one unified, generic antibody foundation model.
arXiv Detail & Related papers (2024-11-20T14:24:26Z)
xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein [74.64101864289572]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.<n>xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.<n>It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z)
xTrimoABFold: De novo Antibody Structure Prediction without MSA [77.47606749555686]
We develop a novel model named xTrimoABFold to predict antibody structure from antibody sequence. The model was trained end-to-end on the antibody structures in PDB by minimizing the ensemble loss of domain-specific focal loss on CDR and the frame-aligned point loss.
arXiv Detail & Related papers (2022-11-30T09:26:08Z)
Incorporating Pre-training Paradigm for Antibody Sequence-Structure Co-design [134.65287929316673]
Deep learning-based computational antibody design has attracted popular attention since it automatically mines the antibody patterns from data that could be complementary to human experiences. The computational methods heavily rely on high-quality antibody structure data, which is quite limited. Fortunately, there exists a large amount of sequence data of antibodies that can help model the CDR and alleviate the reliance on structure data.
arXiv Detail & Related papers (2022-10-26T15:31:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.