Few-shot Protein Fitness Prediction via In-context Learning and Test-time Training
- URL: http://arxiv.org/abs/2512.02315v1
- Date: Tue, 02 Dec 2025 01:20:40 GMT
- Title: Few-shot Protein Fitness Prediction via In-context Learning and Test-time Training
- Authors: Felix Teufel, Aaron W. Kollasch, Yining Huang, Ole Winther, Kevin K. Yang, Pascal Notin, Debora S. Marks,
- Abstract summary: PRIMO is a transformer-based framework that leverages in-context learning and test-time training to adapt rapidly to new proteins.<n>PRIMO learns to prioritize promising variants through a preference-based loss function.<n>This work underscores the power of combining large-scale pre-training with efficient test-time adaptation to tackle challenging protein design tasks.
- Score: 16.300177286668475
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurately predicting protein fitness with minimal experimental data is a persistent challenge in protein engineering. We introduce PRIMO (PRotein In-context Mutation Oracle), a transformer-based framework that leverages in-context learning and test-time training to adapt rapidly to new proteins and assays without large task-specific datasets. By encoding sequence information, auxiliary zero-shot predictions, and sparse experimental labels from many assays as a unified token set in a pre-training masked-language modeling paradigm, PRIMO learns to prioritize promising variants through a preference-based loss function. Across diverse protein families and properties-including both substitution and indel mutations-PRIMO outperforms zero-shot and fully supervised baselines. This work underscores the power of combining large-scale pre-training with efficient test-time adaptation to tackle challenging protein design tasks where data collection is expensive and label availability is limited.
Related papers
- Self Distillation Fine-Tuning of Protein Language Models Improves Versatility in Protein Design [61.2846583160056]
Supervised fine-tuning (SFT) is a standard approach for adapting large language models to specialized domains.<n>This is in part because high-quality annotated data are far more difficult to obtain for proteins than for natural language.<n>We present a simple and general recipe for fast SFT of PLMs, designed to improve the fidelity, reliability, and novelty of generated protein sequences.
arXiv Detail & Related papers (2025-12-10T05:34:47Z) - S$^2$Drug: Bridging Protein Sequence and 3D Structure in Contrastive Representation Learning for Virtual Screening [72.89086338778098]
We propose a two-stage framework for protein-ligand contrastive representation learning.<n>In the first stage, we perform protein sequence pretraining on ChemBL using an ESM2-based backbone.<n>In the second stage, we fine-tune on PDBBind by fusing sequence and structure information through a residue-level gating module.<n>This auxiliary task guides the model to accurately localize binding residues within the protein sequence and capture their 3D spatial arrangement.
arXiv Detail & Related papers (2025-11-10T11:57:47Z) - Meta-Learning for Cross-Task Generalization in Protein Mutation Property Prediction [9.083239192939661]
We introduce the first application of Model-Agnostic Meta-Learning (MAML) to protein mutation property prediction.<n>We also introduce a novel mutation encoding strategy using separator tokens to directly incorporate mutations into sequence context.<n>Our mutation encoding addresses the critical limitation where standard transformers treat mutation positions as unknown tokens, significantly degrading performance.
arXiv Detail & Related papers (2025-10-23T19:09:06Z) - One protein is all you need [17.551862138613405]
Generalization beyond training data remains a central challenge in machine learning for biology.<n>We show that our Protein Test-Time Training (ProteinTTT) method consistently enhances generalization across different models, their sizes, and datasets.
arXiv Detail & Related papers (2024-11-04T14:23:59Z) - Metalic: Meta-Learning In-Context with Protein Language Models [5.868595531658237]
Machine learning has emerged as a promising technique for such prediction tasks.<n>Due to data scarcity, we believe meta-learning will play a pivotal role in advancing protein engineering.
arXiv Detail & Related papers (2024-10-10T20:19:35Z) - Learning to Predict Mutation Effects of Protein-Protein Interactions by Microenvironment-aware Hierarchical Prompt Learning [78.38442423223832]
We develop a novel codebook pre-training task, namely masked microenvironment modeling.
We demonstrate superior performance and training efficiency over state-of-the-art pre-training-based methods in mutation effect prediction.
arXiv Detail & Related papers (2024-05-16T03:53:21Z) - Efficiently Predicting Protein Stability Changes Upon Single-point
Mutation with Large Language Models [51.57843608615827]
The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry.
We introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations.
arXiv Detail & Related papers (2023-12-07T03:25:49Z) - Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.