Open-Source Protein Language Models for Function Prediction and Protein Design
- URL: http://arxiv.org/abs/2412.13519v1
- Date: Wed, 18 Dec 2024 05:41:15 GMT
- Title: Open-Source Protein Language Models for Function Prediction and Protein Design
- Authors: Shivasankaran Vanaja Pandi, Bharath Ramsundar,
- Abstract summary: Protein language models (PLMs) have shown promise in improving the understanding of protein sequences, contributing to advances in areas such as function prediction and protein engineering.
We integrate a PLM into DeepChem, an open-source framework for computational biology and chemistry, to provide a more accessible platform for protein-related tasks.
We evaluate the performance of the integrated model on various protein prediction tasks, showing that it achieves reasonable results across benchmarks.
- Score: 0.0
- License:
- Abstract: Protein language models (PLMs) have shown promise in improving the understanding of protein sequences, contributing to advances in areas such as function prediction and protein engineering. However, training these models from scratch requires significant computational resources, limiting their accessibility. To address this, we integrate a PLM into DeepChem, an open-source framework for computational biology and chemistry, to provide a more accessible platform for protein-related tasks. We evaluate the performance of the integrated model on various protein prediction tasks, showing that it achieves reasonable results across benchmarks. Additionally, we present an exploration of generating plastic-degrading enzyme candidates using the model's embeddings and latent space manipulation techniques. While the results suggest that further refinement is needed, this approach provides a foundation for future work in enzyme design. This study aims to facilitate the use of PLMs in research fields like synthetic biology and environmental sustainability, even for those with limited computational resources.
Related papers
- Computational Protein Science in the Era of Large Language Models (LLMs) [54.35488233989787]
Computational protein science is dedicated to revealing knowledge and developing applications within the protein sequence-structure-function paradigm.
Recently, Language Models (pLMs) have emerged as a milestone in AI due to their unprecedented language processing & generalization capability.
arXiv Detail & Related papers (2025-01-17T16:21:18Z) - Multi-modal Representation Learning Enables Accurate Protein Function Prediction in Low-Data Setting [0.0]
HOPER (HOlistic ProtEin Representation) is a novel framework designed to enhance protein function prediction (PFP) in low-data settings.
Our results highlight the effectiveness of multimodal representation learning for overcoming data limitations in biological research.
arXiv Detail & Related papers (2024-11-22T20:13:55Z) - OneProt: Towards Multi-Modal Protein Foundation Models [5.440531199006399]
We introduce OneProt, a multi-modal AI for proteins that integrates structural, sequence, alignment, and binding site data.
It surpasses state-of-the-art methods in various downstream tasks, including metal ion binding classification, gene-ontology annotation, and enzyme function prediction.
This work expands multi-modal capabilities in protein models, paving the way for applications in drug discovery, biocatalytic reaction planning, and protein engineering.
arXiv Detail & Related papers (2024-11-07T16:54:54Z) - Long-context Protein Language Model [76.95505296417866]
Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design.
Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths.
We propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built off selective structured state-space models.
We also introduce its graph-contextual variant, LC-PLM-G, which contextualizes protein-protein interaction graphs for a second stage of training.
arXiv Detail & Related papers (2024-10-29T16:43:28Z) - ProteinEngine: Empower LLM with Domain Knowledge for Protein Engineering [5.474946062328154]
textscProteinEngine is a human-centered platform aimed at amplifying the capabilities of large language models in protein engineering.
Uniquely, textscProteinEngine assigns three distinct roles to LLMs, facilitating efficient task delegation, specialized task resolution, and effective communication of results.
Our findings highlight the potential of textscProteinEngine to bride the disconnected tools for future research in the protein engineering domain.
arXiv Detail & Related papers (2024-04-21T01:07:33Z) - MAPE-PPI: Towards Effective and Efficient Protein-Protein Interaction
Prediction via Microenvironment-Aware Protein Embedding [82.31506767274841]
Protein-Protein Interactions (PPIs) are fundamental in various biological processes and play a key role in life activities.
MPAE-PPI encodes microenvironments into chemically meaningful discrete codes via a sufficiently large microenvironment "vocabulary"
MPAE-PPI can scale to PPI prediction with millions of PPIs with superior trade-offs between effectiveness and computational efficiency.
arXiv Detail & Related papers (2024-02-22T09:04:41Z) - An Evaluation of Large Language Models in Bioinformatics Research [52.100233156012756]
We study the performance of large language models (LLMs) on a wide spectrum of crucial bioinformatics tasks.
These tasks include the identification of potential coding regions, extraction of named entities for genes and proteins, detection of antimicrobial and anti-cancer peptides, molecular optimization, and resolution of educational bioinformatics problems.
Our findings indicate that, given appropriate prompts, LLMs like GPT variants can successfully handle most of these tasks.
arXiv Detail & Related papers (2024-02-21T11:27:31Z) - PQA: Zero-shot Protein Question Answering for Free-form Scientific Enquiry with Large Language Models [4.5044944051958264]
Protein Question Answering (PQA) is a task designed to answer a wide range of protein-related queries without task-specific training.
Pika comprises a curated, debiased dataset tailored for PQA and a biochemically relevant benchmarking strategy.
arXiv Detail & Related papers (2024-02-21T09:38:17Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein [74.64101864289572]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.