OpenProteinSet: Training data for structural biology at scale
- URL: http://arxiv.org/abs/2308.05326v1
- Date: Thu, 10 Aug 2023 04:01:04 GMT
- Title: OpenProteinSet: Training data for structural biology at scale
- Authors: Gustaf Ahdritz, Nazim Bouatta, Sachin Kadyan, Lukas Jarosch, Daniel
Berenberg, Ian Fisk, Andrew M. Watkins, Stephen Ra, Richard Bonneau, Mohammed
AlQuraishi
- Abstract summary: Multiple sequence alignments (MSAs) of proteins encode rich biological information.
Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance.
OpenProteinSet is an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multiple sequence alignments (MSAs) of proteins encode rich biological
information and have been workhorses in bioinformatic methods for tasks like
protein design and protein structure prediction for decades. Recent
breakthroughs like AlphaFold2 that use transformers to attend directly over
large quantities of raw MSAs have reaffirmed their importance. Generation of
MSAs is highly computationally intensive, however, and no datasets comparable
to those used to train AlphaFold2 have been made available to the research
community, hindering progress in machine learning for proteins. To remedy this
problem, we introduce OpenProteinSet, an open-source corpus of more than 16
million MSAs, associated structural homologs from the Protein Data Bank, and
AlphaFold2 protein structure predictions. We have previously demonstrated the
utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We
expect OpenProteinSet to be broadly useful as training and validation data for
1) diverse tasks focused on protein structure, function, and design and 2)
large-scale multimodal machine learning research.
Related papers
- MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training [48.398329286769304]
Multiple Sequence Alignment (MSA) plays a pivotal role in unveiling the evolutionary trajectories of protein families.
MSAGPT is a novel approach to prompt protein structure predictions via MSA generative pretraining in the low MSA regime.
arXiv Detail & Related papers (2024-06-08T04:23:57Z) - NaNa and MiGu: Semantic Data Augmentation Techniques to Enhance Protein Classification in Graph Neural Networks [60.48306899271866]
We propose novel semantic data augmentation methods to incorporate backbone chemical and side-chain biophysical information into protein classification tasks.
Specifically, we leverage molecular biophysical, secondary structure, chemical bonds, andionic features of proteins to facilitate classification tasks.
arXiv Detail & Related papers (2024-03-21T13:27:57Z) - APACE: AlphaFold2 and advanced computing as a service for accelerated discovery in biophysics [0.2796197251957245]
We introduce APACE, AlphaFold2 and advanced computing as a service.
APACE is up to two orders of magnitude faster than off-the-self AlphaFold2 implementations.
This computational approach may be readily linked with robotics laboratories to automate and accelerate scientific discovery.
arXiv Detail & Related papers (2023-08-15T18:00:01Z) - Enhancing the Protein Tertiary Structure Prediction by Multiple Sequence
Alignment Generation [30.2874172276931]
We introduce MSA-Augmenter, which generates useful, novel protein sequences not currently found in databases.
Our experiments on CASP14 demonstrate that MSA-Augmenter can generate de novo sequences that retain co-evolutionary information from inferior MSAs.
arXiv Detail & Related papers (2023-06-02T14:13:50Z) - Retrieved Sequence Augmentation for Protein Representation Learning [40.13920287967866]
We introduce Retrieved Sequence Augmentation for protein representation learning without additional alignment or pre-processing.
We show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction.
Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences.
arXiv Detail & Related papers (2023-02-24T10:31:45Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein
Language Model as an Alternative [61.984700682903096]
HelixFold-Single is proposed to combine a large-scale protein language model with the superior geometric learning capability of AlphaFold2.
Our proposed method pre-trains a large-scale protein language model with thousands of millions of primary sequences.
We obtain an end-to-end differentiable model to predict the 3D coordinates of atoms from only the primary sequence.
arXiv Detail & Related papers (2022-07-28T07:30:33Z) - PSP: Million-level Protein Sequence Dataset for Protein Structure
Prediction [34.11168458572554]
We present the first million-level protein structure prediction dataset with high coverage and diversity, named as PSP.
This dataset consists of 570k true structure sequences (10TB) and 745k complementary distillation sequences (15TB)
We provide in addition the benchmark training procedure for SOTA protein structure prediction model on this dataset.
arXiv Detail & Related papers (2022-06-24T14:08:44Z) - Learning Geometrically Disentangled Representations of Protein Folding
Simulations [72.03095377508856]
This work focuses on learning a generative neural network on a structural ensemble of a drug-target protein.
Model tasks involve characterizing the distinct structural fluctuations of the protein bound to various drug molecules.
Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations.
arXiv Detail & Related papers (2022-05-20T19:38:00Z) - Protein-RNA interaction prediction with deep learning: Structure matters [19.541738343743592]
Protein-RNA interactions are of vital importance to a variety of cellular activities. Both experimental and computational techniques have been developed to study the interactions.
Recently, AlphaFold has revolutionized the entire protein and biology field. Foreseeably, the protein-RNA interaction prediction will also be promoted significantly in the upcoming years.
This survey summarizes the development of the RBP-RNA interaction field in the past and foresees its future development in the post-AlphaFold era.
arXiv Detail & Related papers (2021-07-26T14:43:36Z) - BERTology Meets Biology: Interpreting Attention in Protein Language
Models [124.8966298974842]
We demonstrate methods for analyzing protein Transformer models through the lens of attention.
We show that attention captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure.
We also present a three-dimensional visualization of the interaction between attention and protein structure.
arXiv Detail & Related papers (2020-06-26T21:50:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.