Related papers: Controllable Protein Sequence Generation with LLM Preference Optimization

Controllable Protein Sequence Generation with LLM Preference Optimization

URL: http://arxiv.org/abs/2501.15007v1
Date: Sat, 25 Jan 2025 00:59:12 GMT
Title: Controllable Protein Sequence Generation with LLM Preference Optimization
Authors: Xiangyu Liu, Yi Liu, Silei Chen, Wei Hu,
Abstract summary: We propose a novel controllable protein design method called CtrlProt.<n> Experiments demonstrate that CtrlProt can meet functionality and structural stability requirements effectively.
Score: 19.28325662879149
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Designing proteins with specific attributes offers an important solution to address biomedical challenges. Pre-trained protein large language models (LLMs) have shown promising results on protein sequence generation. However, to control sequence generation for specific attributes, existing work still exhibits poor functionality and structural stability. In this paper, we propose a novel controllable protein design method called CtrlProt. We finetune a protein LLM with a new multi-listwise preference optimization strategy to improve generation quality and support multi-attribute controllable generation. Experiments demonstrate that CtrlProt can meet functionality and structural stability requirements effectively, achieving state-of-the-art performance in both single-attribute and multi-attribute protein sequence generation.

Related papers

Large Language Model is Secretly a Protein Sequence Optimizer [24.55348363931866]
We consider the protein sequence engineering problem, which aims to find protein sequences with high fitness levels, starting from a given wild-type sequence.<n>We demonstrate large language models (LLMs), despite being trained on massive texts, are secretly protein sequences.
arXiv Detail & Related papers (2025-01-16T03:44:16Z)
Multi-Modal CLIP-Informed Protein Editing [8.927362207499181]
We propose a novel method called ProtET for efficient CLIP-informed protein editing through multi-modality learning. Our approach comprises two stages: in the pretraining stage, contrastive learning aligns protein-biotext representations encoded by two large language models (LLMs) During the protein editing stage, the fused features from editing instruction texts and original protein sequences serve as the final editing condition for generating target protein sequences.
arXiv Detail & Related papers (2024-07-27T16:41:08Z)
Reinforcement Learning for Sequence Design Leveraging Protein Language Models [14.477268882311991]
We propose to use protein language models (PLMs) as a reward function to generate new sequences. We perform extensive experiments on various sequence lengths to benchmark RL-based approaches. We provide comprehensive evaluations along biological plausibility and diversity of the protein.
arXiv Detail & Related papers (2024-07-03T14:31:36Z)
Diffusion on language model embeddings for protein sequence generation [0.5442686600296733]
We introduce DiMA, a model that leverages continuous diffusion to generate amino acid sequences. We quantitatively illustrate the impact of the design choices that lead to its superior performance. Our approach consistently produces novel, diverse protein sequences that accurately reflect the inherent structural and functional diversity of the protein space.
arXiv Detail & Related papers (2024-03-06T14:15:20Z)
Diffusion Language Models Are Versatile Protein Learners [75.98083311705182]
This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences. We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework. After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation.
arXiv Detail & Related papers (2024-02-28T18:57:56Z)
xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein [74.64101864289572]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.<n>xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.<n>It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z)
Efficiently Predicting Protein Stability Changes Upon Single-point Mutation with Large Language Models [51.57843608615827]
The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry. We introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations.
arXiv Detail & Related papers (2023-12-07T03:25:49Z)
Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs) We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness. Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z)
Plug & Play Directed Evolution of Proteins with Gradient-based Discrete MCMC [1.0499611180329804]
A long-standing goal of machine-learning-based protein engineering is to accelerate the discovery of novel mutations. We introduce a sampling framework for evolving proteins in silico that supports mixing and matching a variety of unsupervised models. By composing these models, we aim to improve our ability to evaluate unseen mutations and constrain search to regions of sequence space likely to contain functional proteins.
arXiv Detail & Related papers (2022-12-20T00:26:23Z)
Learning Geometrically Disentangled Representations of Protein Folding Simulations [72.03095377508856]
This work focuses on learning a generative neural network on a structural ensemble of a drug-target protein. Model tasks involve characterizing the distinct structural fluctuations of the protein bound to various drug molecules. Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations.
arXiv Detail & Related papers (2022-05-20T19:38:00Z)
ODBO: Bayesian Optimization with Search Space Prescreening for Directed Protein Evolution [18.726398852721204]
We propose an efficient, experimental design-oriented closed-loop optimization framework for protein directed evolution. ODBO employs a combination of novel low-dimensional protein encoding strategy and Bayesian optimization enhanced with search space prescreening via outlier detection. We conduct and report four protein directed evolution experiments that substantiate the capability of the proposed framework for finding variants with properties of interest.
arXiv Detail & Related papers (2022-05-19T13:21:31Z)
EBM-Fold: Fully-Differentiable Protein Folding Powered by Energy-based Models [53.17320541056843]
We propose a fully-differentiable approach for protein structure optimization, guided by a data-driven generative network. Our EBM-Fold approach can efficiently produce high-quality decoys, compared against traditional Rosetta-based structure optimization routines.
arXiv Detail & Related papers (2021-05-11T03:40:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.