PSP: Million-level Protein Sequence Dataset for Protein Structure
Prediction
- URL: http://arxiv.org/abs/2206.12240v1
- Date: Fri, 24 Jun 2022 14:08:44 GMT
- Title: PSP: Million-level Protein Sequence Dataset for Protein Structure
Prediction
- Authors: Sirui Liu, Jun Zhang, Haotian Chu, Min Wang, Boxin Xue, Ningxi Ni,
Jialiang Yu, Yuhao Xie, Zhenyu Chen, Mengyun Chen, Yuan Liu, Piya Patra, Fan
Xu, Jie Chen, Zidong Wang, Lijiang Yang, Fan Yu, Lei Chen, Yi Qin Gao
- Abstract summary: We present the first million-level protein structure prediction dataset with high coverage and diversity, named as PSP.
This dataset consists of 570k true structure sequences (10TB) and 745k complementary distillation sequences (15TB)
We provide in addition the benchmark training procedure for SOTA protein structure prediction model on this dataset.
- Score: 34.11168458572554
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Proteins are essential component of human life and their structures are
important for function and mechanism analysis. Recent work has shown the
potential of AI-driven methods for protein structure prediction. However, the
development of new models is restricted by the lack of dataset and benchmark
training procedure. To the best of our knowledge, the existing open source
datasets are far less to satisfy the needs of modern protein sequence-structure
related research. To solve this problem, we present the first million-level
protein structure prediction dataset with high coverage and diversity, named as
PSP. This dataset consists of 570k true structure sequences (10TB) and 745k
complementary distillation sequences (15TB). We provide in addition the
benchmark training procedure for SOTA protein structure prediction model on
this dataset. We validate the utility of this dataset for training by
participating CAMEO contest in which our model won the first place. We hope our
PSP dataset together with the training benchmark can enable a broader community
of AI/biology researchers for AI-driven protein related research.
Related papers
- SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models.
It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features.
Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z) - CPE-Pro: A Structure-Sensitive Deep Learning Method for Protein Representation and Origin Evaluation [7.161099050722313]
We develop a structure-sensitive supervised deep learning model, Crystal vs Predicted Evaluator for Protein Structure (CPE-Pro)
CPE-Pro learns the structural information of proteins and captures inter-structural differences to achieve accurate traceability on four data classes.
We utilize Foldseek to encode protein structures into "structure-sequences" and trained a protein Structural Sequence Language Model, SSLM.
arXiv Detail & Related papers (2024-10-21T02:21:56Z) - NaNa and MiGu: Semantic Data Augmentation Techniques to Enhance Protein Classification in Graph Neural Networks [60.48306899271866]
We propose novel semantic data augmentation methods to incorporate backbone chemical and side-chain biophysical information into protein classification tasks.
Specifically, we leverage molecular biophysical, secondary structure, chemical bonds, andionic features of proteins to facilitate classification tasks.
arXiv Detail & Related papers (2024-03-21T13:27:57Z) - Structure-Informed Protein Language Model [38.019425619750265]
We introduce the integration of remote homology detection to distill structural information into protein language models.
We evaluate the impact of this structure-informed training on downstream protein function prediction tasks.
arXiv Detail & Related papers (2024-02-07T09:32:35Z) - Endowing Protein Language Models with Structural Knowledge [5.587293092389789]
We introduce a novel framework that enhances protein language models by integrating protein structural data.
The refined model, termed Protein Structure Transformer (PST), is further pretrained on a small protein structure database.
PST consistently outperforms the state-of-the-art foundation model for protein sequences, ESM-2, setting a new benchmark in protein function prediction.
arXiv Detail & Related papers (2024-01-26T12:47:54Z) - Efficiently Predicting Protein Stability Changes Upon Single-point
Mutation with Large Language Models [51.57843608615827]
The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry.
We introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations.
arXiv Detail & Related papers (2023-12-07T03:25:49Z) - Protein 3D Graph Structure Learning for Robust Structure-based Protein
Property Prediction [43.46012602267272]
Protein structure-based property prediction has emerged as a promising approach for various biological tasks.
Current practices, which simply employ accurately predicted structures during inference, suffer from notable degradation in prediction accuracy.
Our framework is model-agnostic and effective in improving the property prediction of both predicted structures and experimental structures.
arXiv Detail & Related papers (2023-10-14T08:43:42Z) - Data-Efficient Protein 3D Geometric Pretraining via Refinement of
Diffused Protein Structure Decoy [42.49977473599661]
Learning meaningful protein representation is important for a variety of biological downstream tasks such as structure-based drug design.
In this paper, we propose a unified framework for protein pretraining and a 3D geometric-based, data-efficient, and protein-specific pretext task: RefineDiff.
arXiv Detail & Related papers (2023-02-05T14:13:32Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Transfer Learning for Protein Structure Classification at Low Resolution [124.5573289131546]
We show that it is possible to make accurate ($geq$80%) predictions of protein class and architecture from structures determined at low ($leq$3A) resolution.
We provide proof of concept for high-speed, low-cost protein structure classification at low resolution, and a basis for extension to prediction of function.
arXiv Detail & Related papers (2020-08-11T15:01:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.