ProtFlow: Fast Protein Sequence Design via Flow Matching on Compressed Protein Language Model Embeddings
- URL: http://arxiv.org/abs/2504.10983v1
- Date: Tue, 15 Apr 2025 08:46:53 GMT
- Title: ProtFlow: Fast Protein Sequence Design via Flow Matching on Compressed Protein Language Model Embeddings
- Authors: Zitai Kong, Yiheng Zhu, Yinlong Xu, Hanjing Zhou, Mingzhe Yin, Jialu Wu, Hongxia Xu, Chang-Yu Hsieh, Tingjun Hou, Jian Wu,
- Abstract summary: ProtFlow is a fast flow matching-based protein sequence design framework.<n>By compressing and smoothing the latent space, ProtFlow enhances performance while training on limited computational resources.<n>We evaluate ProtFlow across diverse protein design tasks, including general peptides and long-chain proteins, antimicrobial peptides, and antibodies.
- Score: 8.068149785650649
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The design of protein sequences with desired functionalities is a fundamental task in protein engineering. Deep generative methods, such as autoregressive models and diffusion models, have greatly accelerated the discovery of novel protein sequences. However, these methods mainly focus on local or shallow residual semantics and suffer from low inference efficiency, large modeling space and high training cost. To address these challenges, we introduce ProtFlow, a fast flow matching-based protein sequence design framework that operates on embeddings derived from semantically meaningful latent space of protein language models. By compressing and smoothing the latent space, ProtFlow enhances performance while training on limited computational resources. Leveraging reflow techniques, ProtFlow enables high-quality single-step sequence generation. Additionally, we develop a joint design pipeline for the design scene of multichain proteins. We evaluate ProtFlow across diverse protein design tasks, including general peptides and long-chain proteins, antimicrobial peptides, and antibodies. Experimental results demonstrate that ProtFlow outperforms task-specific methods in these applications, underscoring its potential and broad applicability in computational protein sequence design and analysis.
Related papers
- ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation [24.13216117355207]
We propose a novel rectified quaternion flow (ReQFlow) matching method for fast and high-quality protein backbone generation.<n>Our method generates a local translation and a 3D rotation from random noise for each residue in a protein chain.<n> Experiments show that ReQFlow achieves state-of-the-art performance in protein backbone generation.
arXiv Detail & Related papers (2025-02-20T15:20:37Z) - Computational Protein Science in the Era of Large Language Models (LLMs) [54.35488233989787]
Computational protein science is dedicated to revealing knowledge and developing applications within the protein sequence-structure-function paradigm.
Recently, Language Models (pLMs) have emerged as a milestone in AI due to their unprecedented language processing & generalization capability.
arXiv Detail & Related papers (2025-01-17T16:21:18Z) - Multi-Scale Representation Learning for Protein Fitness Prediction [31.735234482320283]
Previous methods have primarily relied on self-supervised models trained on vast, unlabeled protein sequence or structure datasets.
We introduce the Sequence-Structure-Surface Fitness (S3F) model - a novel multimodal representation learning framework that integrates protein features across several scales.
Our approach combines sequence representations from a protein language model with Geometric Vector Perceptron networks encoding protein backbone and detailed surface topology.
arXiv Detail & Related papers (2024-12-02T04:28:10Z) - Improving AlphaFlow for Efficient Protein Ensembles Generation [64.10918970280603]
We propose a feature-conditioned generative model called AlphaFlow-Lit to realize efficient protein ensembles generation.
AlphaFlow-Lit performs on-par with AlphaFlow and surpasses its distilled version without pretraining, all while achieving a significant sampling acceleration of around 47 times.
arXiv Detail & Related papers (2024-07-08T13:36:43Z) - Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation [55.93511121486321]
We introduce FoldFlow-2, a novel sequence-conditioned flow matching model for protein structure generation.<n>We train FoldFlow-2 at scale on a new dataset that is an order of magnitude larger than PDB datasets of prior works.<n>We empirically observe that FoldFlow-2 outperforms previous state-of-the-art protein structure-based generative models.
arXiv Detail & Related papers (2024-05-30T17:53:50Z) - PPFlow: Target-aware Peptide Design with Torsional Flow Matching [52.567714059931646]
We propose a target-aware peptide design method called textscPPFlow to model the internal geometries of torsion angles for the peptide structure design.<n>Besides, we establish a protein-peptide binding dataset named PPBench2024 to fill the void of massive data for the task of structure-based peptide drug design.
arXiv Detail & Related papers (2024-03-05T13:26:42Z) - Progressive Multi-Modality Learning for Inverse Protein Folding [47.095862120116976]
We propose a novel protein design paradigm called MMDesign, which leverages multi-modality transfer learning.
MMDesign is the first framework that combines a pretrained structural module with a pretrained contextual module, using an auto-encoder (AE) based language model to incorporate prior protein semantic knowledge.
Experimental results, only training with the small dataset, demonstrate that MMDesign consistently outperforms baselines on various public benchmarks.
arXiv Detail & Related papers (2023-12-11T10:59:23Z) - Protein Sequence Design with Batch Bayesian Optimisation [0.0]
Protein sequence design is a challenging problem in protein engineering, which aims to discover novel proteins with useful biological functions.
directed evolution is a widely-used approach for protein sequence design, which mimics the evolution cycle in a laboratory environment and conducts an iterative protocol.
We propose a new method based on Batch Bayesian Optimization (Batch BO), a well-established optimization method, for protein sequence design.
arXiv Detail & Related papers (2023-03-18T14:53:20Z) - Protein Sequence and Structure Co-Design with Equivariant Translation [19.816174223173494]
Existing approaches generate both protein sequence and structure using either autoregressive models or diffusion models.
We propose a new approach capable of protein sequence and structure co-design, which iteratively translates both protein sequence and structure into the desired state.
Our model consists of a trigonometry-aware encoder that reasons geometrical constraints and interactions from context features.
All protein amino acids are updated in one shot in each translation step, which significantly accelerates the inference process.
arXiv Detail & Related papers (2022-10-17T06:00:12Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - PDBench: Evaluating Computational Methods for Protein Sequence Design [2.0187324832551385]
We present a benchmark set of proteins and propose tests to assess the performance of deep learning based methods.
Our robust benchmark provides biological insight into the behaviour of design methods, which is essential for evaluating their performance and utility.
arXiv Detail & Related papers (2021-09-16T12:20:03Z) - EBM-Fold: Fully-Differentiable Protein Folding Powered by Energy-based
Models [53.17320541056843]
We propose a fully-differentiable approach for protein structure optimization, guided by a data-driven generative network.
Our EBM-Fold approach can efficiently produce high-quality decoys, compared against traditional Rosetta-based structure optimization routines.
arXiv Detail & Related papers (2021-05-11T03:40:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.