Self Distillation Fine-Tuning of Protein Language Models Improves Versatility in Protein Design
- URL: http://arxiv.org/abs/2512.09329v1
- Date: Wed, 10 Dec 2025 05:34:47 GMT
- Title: Self Distillation Fine-Tuning of Protein Language Models Improves Versatility in Protein Design
- Authors: Amin Tavakoli, Raswanth Murugan, Ozan Gokdemir, Arvind Ramanathan, Frances Arnold, Anima Anandkumar,
- Abstract summary: Supervised fine-tuning (SFT) is a standard approach for adapting large language models to specialized domains.<n>This is in part because high-quality annotated data are far more difficult to obtain for proteins than for natural language.<n>We present a simple and general recipe for fast SFT of PLMs, designed to improve the fidelity, reliability, and novelty of generated protein sequences.
- Score: 61.2846583160056
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Supervised fine-tuning (SFT) is a standard approach for adapting large language models to specialized domains, yet its application to protein sequence modeling and protein language models (PLMs) remains ad hoc. This is in part because high-quality annotated data are far more difficult to obtain for proteins than for natural language. We present a simple and general recipe for fast SFT of PLMs, designed to improve the fidelity, reliability, and novelty of generated protein sequences. Unlike existing approaches that require costly precompiled experimental datasets for SFT, our method leverages the PLM itself, integrating a lightweight curation pipeline with domain-specific filters to construct high-quality training data. These filters can independently refine a PLM's output and identify candidates for in vitro evaluation; when combined with SFT, they enable PLMs to generate more stable and functional enzymes, while expanding exploration into protein sequence space beyond natural variants. Although our approach is agnostic to both the choice of protein language model (PLM) and the protein system, we demonstrate its effectiveness with a genome-scale PLM (GenSLM) applied to the tryptophan synthase enzyme family. The supervised fine-tuned model generates sequences that are not only more novel but also display improved characteristics across both targeted design constraints and emergent protein property measures.
Related papers
- Protein as a Second Language for LLMs [50.34983283157322]
"Protein-as-Second-Language" framework reformulates amino-acid sequences as sentences in a novel symbolic language.<n>We curate a bilingual corpus of 79,926 protein-QA instances spanning attribute prediction, descriptive understanding, and extended reasoning.<n>Our method delivers consistent gains across diverse open-source LLMs and GPT-4, achieving up to 17.2% ROUGE-L improvement.
arXiv Detail & Related papers (2025-10-13T09:21:45Z) - ProteinAE: Protein Diffusion Autoencoders for Structure Encoding [64.77182442408254]
We introduce ProteinAE, a novel and streamlined protein diffusion autoencoder.<n>ProteinAE directly maps protein backbone coordinates from E(3) into a continuous, compact latent space.<n>We demonstrate that ProteinAE achieves state-of-the-art reconstruction quality, outperforming existing autoencoders.
arXiv Detail & Related papers (2025-10-12T14:30:32Z) - Steering Protein Language Models [22.308373820985793]
Activation Steering is a technique originally developed for controlling text generation in Large Language Models.<n>We propose a simple yet effective method that employs activation editing to steer PLM outputs.<n>We show that our methods can be seamlessly integrated into both auto-encoding and autoregressive PLMs without requiring additional training.
arXiv Detail & Related papers (2025-07-01T16:03:55Z) - Controllable Protein Sequence Generation with LLM Preference Optimization [19.28325662879149]
We propose a novel controllable protein design method called CtrlProt.<n> Experiments demonstrate that CtrlProt can meet functionality and structural stability requirements effectively.
arXiv Detail & Related papers (2025-01-25T00:59:12Z) - Large Language Model is Secretly a Protein Sequence Optimizer [24.55348363931866]
We consider the protein sequence engineering problem, which aims to find protein sequences with high fitness levels, starting from a given wild-type sequence.<n>We demonstrate large language models (LLMs), despite being trained on massive texts, are secretly protein sequences.
arXiv Detail & Related papers (2025-01-16T03:44:16Z) - Long-context Protein Language Modeling Using Bidirectional Mamba with Shared Projection Layers [76.95505296417866]
Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design.<n>Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths.<n>In this work, we propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built upon selective structured state-space models.
arXiv Detail & Related papers (2024-10-29T16:43:28Z) - Reinforcement Learning for Sequence Design Leveraging Protein Language Models [14.477268882311991]
We propose to use protein language models (PLMs) as a reward function to generate new sequences.
We perform extensive experiments on various sequence lengths to benchmark RL-based approaches.
We provide comprehensive evaluations along biological plausibility and diversity of the protein.
arXiv Detail & Related papers (2024-07-03T14:31:36Z) - ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases.
Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions.
We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z) - Efficiently Predicting Protein Stability Changes Upon Single-point
Mutation with Large Language Models [51.57843608615827]
The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry.
We introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations.
arXiv Detail & Related papers (2023-12-07T03:25:49Z) - ProtFIM: Fill-in-Middle Protein Sequence Design via Protein Language
Models [0.0]
In real-world protein engineering, there are many cases where the amino acids in the middle of a protein sequence are optimized while maintaining other residues.
Protein language models (pLMs) have been a promising tool for protein sequence design.
We show that language models trained via fill-in-middle transformation, called ProtFIM, are more appropriate for protein engineering.
arXiv Detail & Related papers (2023-03-29T04:35:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.