CoPeP: Benchmarking Continual Pretraining for Protein Language Models
- URL: http://arxiv.org/abs/2603.00253v2
- Date: Tue, 03 Mar 2026 17:50:01 GMT
- Title: CoPeP: Benchmarking Continual Pretraining for Protein Language Models
- Authors: Darshan Patil, Pranshu Malviya, Mathieu Reymond, Quentin Fournier, Sarath Chandar,
- Abstract summary: We introduce the Continual Pretraining of Protein Language Models benchmark.<n>We define metrics to assess pLM performance across 31 protein understanding tasks.<n>We evaluate several methods from the continual learning literature, including replay, unlearning, and plasticity-based methods.
- Score: 16.835651059100595
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Protein language models (pLMs) have recently gained significant attention for their ability to uncover relationships between sequence, structure, and function from evolutionary statistics, thereby accelerating therapeutic drug discovery. These models learn from large protein databases that are continuously updated by the biology community and whose dynamic nature motivates the application of continual learning, not only to keep up with the ever-growing data, but also as an opportunity to take advantage of the temporal meta-information that is created during this process. As a result, we introduce the Continual Pretraining of Protein Language Models (CoPeP) benchmark, a novel benchmark for evaluating continual learning approaches on pLMs. Specifically, we curate a sequence of protein datasets derived from the UniProt Knowledgebase spanning a decade and define metrics to assess pLM performance across 31 protein understanding tasks. We evaluate several methods from the continual learning literature, including replay, unlearning, and plasticity-based methods, some of which have never been applied to models and data of this scale. Our findings reveal that incorporating temporal meta-information improves perplexity by up to 7% even when compared to training on data from all tasks jointly. Moreover, even at scale, several continual learning methods outperform naive continual pretraining. The CoPeP benchmark offers an exciting opportunity to study these methods at scale in an impactful real-world application.
Related papers
- Zero-Shot Learning with Subsequence Reordering Pretraining for Compound-Protein Interaction [39.13469810619366]
We propose a novel approach that pretrains protein representations for CPI prediction tasks using subsequence reordering.<n>We apply length-variable protein augmentation to ensure excellent pretraining performance on small training datasets.<n>Compared to existing pre-training models, our model demonstrates superior performance, particularly in data-scarce scenarios.
arXiv Detail & Related papers (2025-07-28T15:31:15Z) - Test-time Offline Reinforcement Learning on Goal-related Experience [50.94457794664909]
Research in foundation models has shown that performance can be substantially improved through test-time training.<n>We propose a novel self-supervised data selection criterion, which selects transitions from an offline dataset according to their relevance to the current state.<n>Our goal-conditioned test-time training (GC-TTT) algorithm applies this routine in a receding-horizon fashion during evaluation, adapting the policy to the current trajectory as it is being rolled out.
arXiv Detail & Related papers (2025-07-24T21:11:39Z) - Curriculum Learning for Biological Sequence Prediction: The Case of De Novo Peptide Sequencing [21.01399785232482]
We propose an improved non-autoregressive peptide sequencing model that incorporates a structured protein sequence curriculum learning strategy.<n>Our curriculum learning strategy reduces NAT training failures frequency by more than 90% based on sampled training over various data distributions.
arXiv Detail & Related papers (2025-06-16T13:44:25Z) - Rethinking Text-based Protein Understanding: Retrieval or LLM? [35.322164434180365]
protein-text models have gained significant attention for their potential in protein generation and understanding.<n>Current approaches focus on integrating protein-related knowledge into large language models through continued pretraining and multi-modal alignment.<n>We propose a retrieval-enhanced method, which significantly outperforms fine-tuned LLMs for protein-to-text generation and shows accuracy and efficiency in training-free scenarios.
arXiv Detail & Related papers (2025-05-26T06:25:43Z) - Multi-Epoch learning with Data Augmentation for Deep Click-Through Rate Prediction [53.88231294380083]
We introduce a novel Multi-Epoch learning with Data Augmentation (MEDA) framework, suitable for both non-continual and continual learning scenarios.
MEDA minimizes overfitting by reducing the dependency of the embedding layer on subsequent training data.
Our findings confirm that pre-trained layers can adapt to new embedding spaces, enhancing performance without overfitting.
arXiv Detail & Related papers (2024-06-27T04:00:15Z) - Continual Learning with Pre-Trained Models: A Survey [61.97613090666247]
Continual Learning aims to overcome the catastrophic forgetting of former knowledge when learning new ones.
This paper presents a comprehensive survey of the latest advancements in PTM-based CL.
arXiv Detail & Related papers (2024-01-29T18:27:52Z) - ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP)
ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective.
We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z) - Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z) - LifeLonger: A Benchmark for Continual Disease Classification [59.13735398630546]
We introduce LifeLonger, a benchmark for continual disease classification on the MedMNIST collection.
Task and class incremental learning of diseases address the issue of classifying new samples without re-training the models from scratch.
Cross-domain incremental learning addresses the issue of dealing with datasets originating from different institutions while retaining the previously obtained knowledge.
arXiv Detail & Related papers (2022-04-12T12:25:05Z) - Lifelong Pretraining: Continually Adapting Language Models to Emerging
Corpora [31.136334214818305]
We study a lifelong language model pretraining challenge where a PTLM is continually updated so as to adapt to emerging data.
Over a domain-incremental research paper stream and a chronologically ordered tweet stream, we incrementally pretrain a PTLM with different continual learning algorithms.
Our experiments show continual learning algorithms improve knowledge preservation, with logit distillation being the most effective approach.
arXiv Detail & Related papers (2021-10-16T09:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.