Exploring the Protein Sequence Space with Global Generative Models
- URL: http://arxiv.org/abs/2305.01941v1
- Date: Wed, 3 May 2023 07:45:29 GMT
- Title: Exploring the Protein Sequence Space with Global Generative Models
- Authors: Sergio Romero-Romero, Sebastian Lindner, Noelia Ferruz
- Abstract summary: Language models have demonstrated exceptional capabilities in processing, translating, and generating human languages.
Protein generative models have been utilized to embed proteins, generate novel ones, and predict tertiary structures.
In this book chapter, we provide an overview of the use of protein generative models, reviewing 1) language models for the design of novel artificial proteins, 2) works that use non-Transformer architectures, and 3) applications in directed evolution approaches.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advancements in specialized large-scale architectures for training
image and language have profoundly impacted the field of computer vision and
natural language processing (NLP). Language models, such as the recent ChatGPT
and GPT4 have demonstrated exceptional capabilities in processing, translating,
and generating human languages. These breakthroughs have also been reflected in
protein research, leading to the rapid development of numerous new methods in a
short time, with unprecedented performance. Language models, in particular,
have seen widespread use in protein research, as they have been utilized to
embed proteins, generate novel ones, and predict tertiary structures. In this
book chapter, we provide an overview of the use of protein generative models,
reviewing 1) language models for the design of novel artificial proteins, 2)
works that use non-Transformer architectures, and 3) applications in directed
evolution approaches.
Related papers
- A Comprehensive Review of Transformer-based language models for Protein Sequence Analysis and Design [0.9600277231719874]
The impact of Transformer-based language models has been unprecedented in Natural Language Processing (NLP)<n>The success of such models has also led to their adoption in other fields including bioinformatics.<n>In this review, we have discussed and analysed a significant number of works pertaining to such applications.
arXiv Detail & Related papers (2025-07-18T04:20:33Z) - Prot42: a Novel Family of Protein Language Models for Target-aware Protein Binder Generation [3.2039076408339353]
We introduce Prot42, a novel family of Protein Language Models (pLMs) pretrained on vast amounts of unlabeled protein sequences.
Remarkably, our models handle sequences up to 8,192 amino acids, significantly surpassing standard limitations.
Prot42 excels in generating high-affinity protein binders and sequence-specific DNA-binding proteins.
arXiv Detail & Related papers (2025-04-06T11:43:12Z) - Nature Language Model: Deciphering the Language of Nature for Scientific Discovery [105.55751854768297]
Foundation models have revolutionized natural language processing and artificial intelligence.
We introduce Nature Language Model (NatureLM), a sequence-based science foundation model for scientific discovery.
arXiv Detail & Related papers (2025-02-11T13:08:03Z) - Computational Protein Science in the Era of Large Language Models (LLMs) [54.35488233989787]
Computational protein science is dedicated to revealing knowledge and developing applications within the protein sequence-structure-function paradigm.
Recently, Language Models (pLMs) have emerged as a milestone in AI due to their unprecedented language processing & generalization capability.
arXiv Detail & Related papers (2025-01-17T16:21:18Z) - ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training [82.37346937497136]
We propose a versatile cross-modal large language model (LLM) for both protein-centric and protein-language tasks.
ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs.
By developing a specialized protein vocabulary, we equip the model with the capability to predict not just natural language but also proteins from a vast pool of candidates.
arXiv Detail & Related papers (2024-02-28T01:29:55Z) - Endowing Protein Language Models with Structural Knowledge [5.587293092389789]
We introduce a novel framework that enhances protein language models by integrating protein structural data.
The refined model, termed Protein Structure Transformer (PST), is further pretrained on a small protein structure database.
PST consistently outperforms the state-of-the-art foundation model for protein sequences, ESM-2, setting a new benchmark in protein function prediction.
arXiv Detail & Related papers (2024-01-26T12:47:54Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - Generative artificial intelligence for de novo protein design [1.2021565114959365]
Generative architectures seem adept at generating novel, yet realistic proteins.
Design protocols now achieve experimental success rates nearing 20%.
Despite extensive progress, there are clear field-wide challenges.
arXiv Detail & Related papers (2023-10-15T00:02:22Z) - InstructProtein: Aligning Human and Protein Language via Knowledge
Instruction [38.46621806898224]
Large Language Models (LLMs) have revolutionized the field of natural language processing, but they fall short in comprehending biological sequences such as proteins.
We propose InstructProtein, which possesses bidirectional generation capabilities in both human and protein languages.
InstructProtein serves as a pioneering step towards text-based protein function prediction and sequence design.
arXiv Detail & Related papers (2023-10-05T02:45:39Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Integration of Pre-trained Protein Language Models into Geometric Deep
Learning Networks [68.90692290665648]
We integrate knowledge learned by protein language models into several state-of-the-art geometric networks.
Our findings show an overall improvement of 20% over baselines.
Strong evidence indicates that the incorporation of protein language models' knowledge enhances geometric networks' capacity by a significant margin.
arXiv Detail & Related papers (2022-12-07T04:04:04Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - How much do language models copy from their training data? Evaluating
linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text.
Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions?
We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z) - Protein sequence-to-structure learning: Is this the end(-to-end
revolution)? [0.8399688944263843]
In CASP14, deep learning has boosted the field to unanticipated levels reaching near-experimental accuracy.
Novel emerging approaches include (i) geometric learning, i.e. learning on representations such as graphs, 3D Voronoi tessellations, and point clouds.
We provide an overview and our opinion of the novel deep learning approaches developed in the last two years and widely used in CASP14.
arXiv Detail & Related papers (2021-05-16T10:46:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.