Protein Design with Dynamic Protein Vocabulary
- URL: http://arxiv.org/abs/2505.18966v1
- Date: Sun, 25 May 2025 03:50:50 GMT
- Title: Protein Design with Dynamic Protein Vocabulary
- Authors: Nuowei Liu, Jiahao Kuang, Yanting Liu, Changzhi Sun, Tao Ji, Yuanbin Wu, Man Lan,
- Abstract summary: We introduce ProDVa, a novel protein design approach that integrates a text encoder for functional descriptions, a protein language model for designing proteins, and a fragment encoder to dynamically retrieve protein fragments.<n>Compared to state-of-the-art models, ProDVa achieves comparable function alignment using less than 0.04% of the training data, while designing significantly more well-folded proteins.
- Score: 22.358650729894443
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Protein design is a fundamental challenge in biotechnology, aiming to design novel sequences with specific functions within the vast space of possible proteins. Recent advances in deep generative models have enabled function-based protein design from textual descriptions, yet struggle with structural plausibility. Inspired by classical protein design methods that leverage natural protein structures, we explore whether incorporating fragments from natural proteins can enhance foldability in generative models. Our empirical results show that even random incorporation of fragments improves foldability. Building on this insight, we introduce ProDVa, a novel protein design approach that integrates a text encoder for functional descriptions, a protein language model for designing proteins, and a fragment encoder to dynamically retrieve protein fragments based on textual functional descriptions. Experimental results demonstrate that our approach effectively designs protein sequences that are both functionally aligned and structurally plausible. Compared to state-of-the-art models, ProDVa achieves comparable function alignment using less than 0.04% of the training data, while designing significantly more well-folded proteins, with the proportion of proteins having pLDDT above 70 increasing by 7.38% and those with PAE below 10 increasing by 9.6%.
Related papers
- ProtDAT: A Unified Framework for Protein Sequence Design from Any Protein Text Description [7.198238666986253]
We propose a de novo fine-grained framework capable of designing proteins from any descriptive text input.<n>Prot DAT builds upon the inherent characteristics of protein data to unify sequences and text as a cohesive whole rather than separate entities.<n> Experimental results demonstrate that Prot DAT achieves the state-of-the-art performance in protein sequence generation, excelling in rationality, functionality, structural similarity, and validity.
arXiv Detail & Related papers (2024-12-05T11:05:46Z) - ProteinWeaver: A Divide-and-Assembly Approach for Protein Backbone Design [61.19456204667385]
We introduce ProteinWeaver, a two-stage framework for protein backbone design.<n>ProteinWeaver generates high-quality, novel protein backbones through versatile domain assembly.<n>By introducing a divide-and-assembly' paradigm, ProteinWeaver advances protein engineering and opens new avenues for functional protein design.
arXiv Detail & Related papers (2024-11-08T08:10:49Z) - Model-based reinforcement learning for protein backbone design [1.7383284836821535]
We propose the use of AlphaZero to generate protein backbones, meeting shape and structural scoring requirements.
We extend an existing Monte Carlo tree search (MCTS) framework by incorporating a novel threshold-based reward and secondary objectives.
AlphaZero consistently surpasses baseline MCTS by more than 100% in top-down protein design tasks.
arXiv Detail & Related papers (2024-05-03T10:24:33Z) - Annotation-guided Protein Design with Multi-Level Domain Alignment [39.79713846491306]
We propose Protein- Alignment Generation, PAAG, a multi-modality protein design framework.<n>It integrates the textual annotations extracted from protein database for controllable generation in sequence space.<n>Specifically, PAAG can explicitly generate proteins containing specific domains conditioned on the corresponding domain annotations.
arXiv Detail & Related papers (2024-04-18T09:37:54Z) - Enhancing Protein Predictive Models via Proteins Data Augmentation: A
Benchmark and New Directions [58.819567030843025]
This paper extends data augmentation techniques previously used for images and texts to proteins and then benchmarks these techniques on a variety of protein-related tasks.
We propose two novel semantic-level protein augmentation methods, namely Integrated Gradients Substitution and Back Translation Substitution.
Finally, we integrate extended and proposed augmentations into an augmentation pool and propose a simple but effective framework, namely Automated Protein Augmentation (APA)
arXiv Detail & Related papers (2024-03-01T07:58:29Z) - A Latent Diffusion Model for Protein Structure Generation [50.74232632854264]
We propose a latent diffusion model that can reduce the complexity of protein modeling.
We show that our method can effectively generate novel protein backbone structures with high designability and efficiency.
arXiv Detail & Related papers (2023-05-06T19:10:19Z) - A Text-guided Protein Design Framework [106.79061950107922]
We propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design.<n>ProteinDT consists of three subsequent steps: ProteinCLAP which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that creates the protein sequences from the representation.<n>We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90% accuracy for text-guided protein generation; (2) best hit ratio on 12 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks.
arXiv Detail & Related papers (2023-02-09T12:59:16Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Learning Geometrically Disentangled Representations of Protein Folding
Simulations [72.03095377508856]
This work focuses on learning a generative neural network on a structural ensemble of a drug-target protein.
Model tasks involve characterizing the distinct structural fluctuations of the protein bound to various drug molecules.
Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations.
arXiv Detail & Related papers (2022-05-20T19:38:00Z) - Deep Generative Modeling for Protein Design [0.0]
Deep learning approaches have produced breakthroughs in fields such as image classification and natural language processing.
generative models of proteins have been developed that encompass all known protein sequences, model specific protein families, or extrapolate the dynamics of individual proteins.
We discuss five classes of generative models that have been most successful at modeling proteins and provide a framework for model guided protein design.
arXiv Detail & Related papers (2021-08-31T14:38:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.