Exploring the limits of pre-trained embeddings in machine-guided protein design: a case study on predicting AAV vector viability
- URL: http://arxiv.org/abs/2602.14828v1
- Date: Mon, 16 Feb 2026 15:21:11 GMT
- Title: Exploring the limits of pre-trained embeddings in machine-guided protein design: a case study on predicting AAV vector viability
- Authors: Ana F. Rodrigues, Lucas Ferraz, Laura Balbi, Pedro Giesteira Cotovio, Catia Pesquita,
- Abstract summary: Protein bioengineering poses unique challenges for sequence representation.<n>Experiments typically feature few mutations, which are either sparsely distributed across the entire sequence or densely concentrated within localized regions.<n>This limits the ability of sequence-level representations to extract meaningful signals.
- Score: 0.39146761527401425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Effective representations of protein sequences are widely recognized as a cornerstone of machine learning-based protein design. Yet, protein bioengineering poses unique challenges for sequence representation, as experimental datasets typically feature few mutations, which are either sparsely distributed across the entire sequence or densely concentrated within localized regions. This limits the ability of sequence-level representations to extract functionally meaningful signals. In addition, comprehensive comparative studies remain scarce, despite their crucial role in clarifying which representations best encode relevant information and ultimately support superior predictive performance. In this study, we systematically evaluate multiple ProtBERT and ESM2 embedding variants as sequence representations, using the adeno-associated virus capsid as a case study and prototypical example of bioengineering, where functional optimization is targeted through highly localized sequence variation within an otherwise large protein. Our results reveal that, prior to fine-tuning, amino acid-level embeddings outperform sequence-level representations in supervised predictive tasks, whereas the latter tend to be more effective in unsupervised settings. However, optimal performance is only achieved when embeddings are fine-tuned with task-specific labels, with sequence-level representations providing the best performance. Moreover, our findings indicate that the extent of sequence variation required to produce notable shifts in sequence representations exceeds what is typically explored in bioengineering studies, showing the need for fine-tuning in datasets characterized by sparse or highly localized mutations.
Related papers
- S$^2$Drug: Bridging Protein Sequence and 3D Structure in Contrastive Representation Learning for Virtual Screening [72.89086338778098]
We propose a two-stage framework for protein-ligand contrastive representation learning.<n>In the first stage, we perform protein sequence pretraining on ChemBL using an ESM2-based backbone.<n>In the second stage, we fine-tune on PDBBind by fusing sequence and structure information through a residue-level gating module.<n>This auxiliary task guides the model to accurately localize binding residues within the protein sequence and capture their 3D spatial arrangement.
arXiv Detail & Related papers (2025-11-10T11:57:47Z) - ProteoKnight: Convolution-based phage virion protein classification and uncertainty analysis [0.0]
This paper introduces ProteoKnight, a new image-based encoding method that addresses spatial constraints in existing techniques.<n>Our study evaluates prediction uncertainty in binary PVP classification through Monte Carlo Dropout.<n>Our experiments achieved 90.8% accuracy in binary classification, comparable to state-of-the-art methods.
arXiv Detail & Related papers (2025-08-10T13:45:08Z) - SeqProFT: Applying LoRA Finetuning for Sequence-only Protein Property Predictions [8.112057136324431]
This study employs the LoRA method to perform end-to-end fine-tuning of the ESM-2 model.
A multi-head attention mechanism is integrated into the downstream network to combine sequence features with contact map information.
arXiv Detail & Related papers (2024-11-18T12:40:39Z) - MeToken: Uniform Micro-environment Token Boosts Post-Translational Modification Prediction [65.33218256339151]
Post-translational modifications (PTMs) profoundly expand the complexity and functionality of the proteome.
Existing computational approaches predominantly focus on protein sequences to predict PTM sites, driven by the recognition of sequence-dependent motifs.
We introduce the MeToken model, which tokenizes the micro-environment of each acid, integrating both sequence and structural information into unified discrete tokens.
arXiv Detail & Related papers (2024-11-04T07:14:28Z) - Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - Learning to Predict Mutation Effects of Protein-Protein Interactions by Microenvironment-aware Hierarchical Prompt Learning [78.38442423223832]
We develop a novel codebook pre-training task, namely masked microenvironment modeling.
We demonstrate superior performance and training efficiency over state-of-the-art pre-training-based methods in mutation effect prediction.
arXiv Detail & Related papers (2024-05-16T03:53:21Z) - Efficiently Predicting Protein Stability Changes Upon Single-point
Mutation with Large Language Models [51.57843608615827]
The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry.
We introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations.
arXiv Detail & Related papers (2023-12-07T03:25:49Z) - Fast and Functional Structured Data Generators Rooted in Out-of-Equilibrium Physics [44.97217246897902]
We address the challenge of using energy-based models to produce high-quality, label-specific data in structured datasets.
Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing.
We use a novel training algorithm that exploits non-equilibrium effects.
arXiv Detail & Related papers (2023-07-13T15:08:44Z) - SESNet: sequence-structure feature-integrated deep learning method for
data-efficient protein engineering [6.216757583450049]
We develop SESNet, a supervised deep-learning model to predict the fitness for protein mutants.
We show that SESNet outperforms state-of-the-art models for predicting the sequence-function relationship.
Our model can achieve strikingly high accuracy in prediction of the fitness of protein mutants, especially for the higher order variants.
arXiv Detail & Related papers (2022-12-29T01:49:52Z) - Pre-training Co-evolutionary Protein Representation via A Pairwise
Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences.
We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM)
Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z) - EBM-Fold: Fully-Differentiable Protein Folding Powered by Energy-based
Models [53.17320541056843]
We propose a fully-differentiable approach for protein structure optimization, guided by a data-driven generative network.
Our EBM-Fold approach can efficiently produce high-quality decoys, compared against traditional Rosetta-based structure optimization routines.
arXiv Detail & Related papers (2021-05-11T03:40:29Z) - Combination of digital signal processing and assembled predictive models
facilitates the rational design of proteins [0.0]
Predicting the effect of mutations in proteins is one of the most critical challenges in protein engineering.
We use clustering, embedding, and dimensionality reduction techniques to select combinations of physicochemical properties for the encoding stage.
We then select the best performing predictive models in each set of properties and create an assembled model.
arXiv Detail & Related papers (2020-10-07T16:35:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.