PDBench: Evaluating Computational Methods for Protein Sequence Design
- URL: http://arxiv.org/abs/2109.07925v2
- Date: Fri, 17 Sep 2021 09:23:31 GMT
- Title: PDBench: Evaluating Computational Methods for Protein Sequence Design
- Authors: Leonardo V. Castorina, Rokas Petrenas, Kartic Subr and Christopher W.
Wood
- Abstract summary: We present a benchmark set of proteins and propose tests to assess the performance of deep learning based methods.
Our robust benchmark provides biological insight into the behaviour of design methods, which is essential for evaluating their performance and utility.
- Score: 2.0187324832551385
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Proteins perform critical processes in all living systems: converting solar
energy into chemical energy, replicating DNA, as the basis of highly performant
materials, sensing and much more. While an incredible range of functionality
has been sampled in nature, it accounts for a tiny fraction of the possible
protein universe. If we could tap into this pool of unexplored protein
structures, we could search for novel proteins with useful properties that we
could apply to tackle the environmental and medical challenges facing humanity.
This is the purpose of protein design.
Sequence design is an important aspect of protein design, and many successful
methods to do this have been developed. Recently, deep-learning methods that
frame it as a classification problem have emerged as a powerful approach.
Beyond their reported improvement in performance, their primary advantage over
physics-based methods is that the computational burden is shifted from the user
to the developers, thereby increasing accessibility to the design method.
Despite this trend, the tools for assessment and comparison of such models
remain quite generic. The goal of this paper is to both address the timely
problem of evaluation and to shine a spotlight, within the Machine Learning
community, on specific assessment criteria that will accelerate impact.
We present a carefully curated benchmark set of proteins and propose a number
of standard tests to assess the performance of deep learning based methods. Our
robust benchmark provides biological insight into the behaviour of design
methods, which is essential for evaluating their performance and utility. We
compare five existing models with two novel models for sequence prediction.
Finally, we test the designs produced by these models with AlphaFold2, a
state-of-the-art structure-prediction algorithm, to determine if they are
likely to fold into the intended 3D shapes.
Related papers
- ProteinBench: A Holistic Evaluation of Protein Foundation Models [53.59325047872512]
We introduce ProteinBench, a holistic evaluation framework for protein foundation models.
Our approach consists of three key components: (i) A taxonomic classification of tasks that broadly encompass the main challenges in the protein domain, based on the relationships between different protein modalities; (ii) A multi-metric evaluation approach that assesses performance across four key dimensions: quality, novelty, diversity, and robustness; and (iii) In-depth analyses from various user objectives, providing a holistic view of model performance.
arXiv Detail & Related papers (2024-09-10T06:52:33Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - PDB-Struct: A Comprehensive Benchmark for Structure-based Protein Design [19.324059406159325]
We introduce two novel metrics: refoldability-based metric and stability-based metric.
ByProt, ProteinMPNN, and ESM-IF perform exceptionally well on our benchmark, while ESM-Design and AF-Design fall short.
Our proposed benchmark paves the way for a fair and comprehensive evaluation of protein design methods.
arXiv Detail & Related papers (2023-11-30T02:37:55Z) - Protein Sequence Design with Batch Bayesian Optimisation [0.0]
Protein sequence design is a challenging problem in protein engineering, which aims to discover novel proteins with useful biological functions.
directed evolution is a widely-used approach for protein sequence design, which mimics the evolution cycle in a laboratory environment and conducts an iterative protocol.
We propose a new method based on Batch Bayesian Optimization (Batch BO), a well-established optimization method, for protein sequence design.
arXiv Detail & Related papers (2023-03-18T14:53:20Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Deep Learning Methods for Protein Family Classification on PDB
Sequencing Data [0.0]
We demonstrate and compare the performance of several deep learning frameworks, including novel bi-directional LSTM and convolutional models, on widely available sequencing data.
Our results show that our deep learning models deliver superior performance to classical machine learning methods, with the convolutional architecture providing the most impressive inference performance.
arXiv Detail & Related papers (2022-07-14T06:11:32Z) - Learning Geometrically Disentangled Representations of Protein Folding
Simulations [72.03095377508856]
This work focuses on learning a generative neural network on a structural ensemble of a drug-target protein.
Model tasks involve characterizing the distinct structural fluctuations of the protein bound to various drug molecules.
Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations.
arXiv Detail & Related papers (2022-05-20T19:38:00Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Protein Representation Learning by Geometric Structure Pretraining [27.723095456631906]
Existing approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences.
We first present a simple yet effective encoder to learn protein geometry features.
Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods using much less data.
arXiv Detail & Related papers (2022-03-11T17:52:13Z) - Protein model quality assessment using rotation-equivariant,
hierarchical neural networks [8.373439916313018]
We present a novel deep learning approach to assess the quality of a protein model.
Our method achieves state-of-the-art results in scoring protein models submitted to recent rounds of CASP.
arXiv Detail & Related papers (2020-11-27T05:03:53Z) - Energy-based models for atomic-resolution protein conformations [88.68597850243138]
We propose an energy-based model (EBM) of protein conformations that operates at atomic scale.
The model is trained solely on crystallized protein data.
An investigation of the model's outputs and hidden representations finds that it captures physicochemical properties relevant to protein energy.
arXiv Detail & Related papers (2020-04-27T20:45:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.