Related papers: Deep Learning Methods for Protein Family Classification on PDB Sequencing Data

Deep Learning Methods for Protein Family Classification on PDB Sequencing Data

URL: http://arxiv.org/abs/2207.06678v1
Date: Thu, 14 Jul 2022 06:11:32 GMT
Title: Deep Learning Methods for Protein Family Classification on PDB Sequencing Data
Authors: Aaron Wang
Abstract summary: We demonstrate and compare the performance of several deep learning frameworks, including novel bi-directional LSTM and convolutional models, on widely available sequencing data. Our results show that our deep learning models deliver superior performance to classical machine learning methods, with the convolutional architecture providing the most impressive inference performance.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Composed of amino acid chains that influence how they fold and thus dictating their function and features, proteins are a class of macromolecules that play a central role in major biological processes and are required for the structure, function, and regulation of the body's tissues. Understanding protein functions is vital to the development of therapeutics and precision medicine, and hence the ability to classify proteins and their functions based on measurable features is crucial; indeed, the automatic inference of a protein's properties from its sequence of amino acids, known as its primary structure, remains an important open problem within the field of bioinformatics, especially given the recent advancements in sequencing technologies and the extensive number of known but uncategorized proteins with unknown properties. In this work, we demonstrate and compare the performance of several deep learning frameworks, including novel bi-directional LSTM and convolutional models, on widely available sequencing data from the Protein Data Bank (PDB) of the Research Collaboratory for Structural Bioinformatics (RCSB), as well as benchmark this performance against classical machine learning approaches, including k-nearest neighbors and multinomial regression classifiers, trained on experimental data. Our results show that our deep learning models deliver superior performance to classical machine learning methods, with the convolutional architecture providing the most impressive inference performance.

Related papers

Aligning Proteins and Language: A Foundation Model for Protein Retrieval [30.32156711268032]
This paper aims to retrieve proteins with similar structures and semantics from large-scale protein dataset.<n>Motivated by the recent progress of vision-caption models (VLMs), we propose a CLIP-style framework for aligning 3D protein structures with functional annotations.
arXiv Detail & Related papers (2025-05-27T08:13:08Z)
Computational Protein Science in the Era of Large Language Models (LLMs) [54.35488233989787]
Computational protein science is dedicated to revealing knowledge and developing applications within the protein sequence-structure-function paradigm. Recently, Language Models (pLMs) have emerged as a milestone in AI due to their unprecedented language processing & generalization capability.
arXiv Detail & Related papers (2025-01-17T16:21:18Z)
A Survey of Deep Learning Methods in Protein Bioinformatics and its Impact on Protein Design [3.5897534810405403]
Deep learning has demonstrated remarkable performance in fields such as computer vision and natural language processing. It has been increasingly applied in recent years to the data-rich domain of protein sequences with great success. The performance improvements achieved by deep learning unlocks new possibilities in the field of protein bioinformatics.
arXiv Detail & Related papers (2025-01-02T05:21:34Z)
SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models. It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features. Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z)
CPE-Pro: A Structure-Sensitive Deep Learning Method for Protein Representation and Origin Evaluation [7.161099050722313]
We develop a structure-sensitive supervised deep learning model, Crystal vs Predicted Evaluator for Protein Structure (CPE-Pro) CPE-Pro learns the structural information of proteins and captures inter-structural differences to achieve accurate traceability on four data classes. We utilize Foldseek to encode protein structures into "structure-sequences" and trained a protein Structural Sequence Language Model, SSLM.
arXiv Detail & Related papers (2024-10-21T02:21:56Z)
Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding [43.811432723460534]
We introduce Structure-Enhanced Protein Instruction Tuning (SEPIT) framework to bridge this gap. Our approach integrates a noval structure-aware module into pLMs to inform them with structural knowledge, and then connects these enhanced pLMs to large language models (LLMs) to generate understanding of proteins. We construct the largest and most comprehensive protein instruction dataset to date, which allows us to train and evaluate the general-purpose protein understanding model.
arXiv Detail & Related papers (2024-10-04T16:02:50Z)
Protein Representation Learning with Sequence Information Embedding: Does it Always Lead to a Better Performance? [4.7077642423577775]
We propose ProtLOCA, a local geometry alignment method based solely on amino acid structure representation. Our method outperforms existing sequence- and structure-based representation learning methods by more quickly and accurately matching structurally consistent protein domains.
arXiv Detail & Related papers (2024-06-28T08:54:37Z)
NaNa and MiGu: Semantic Data Augmentation Techniques to Enhance Protein Classification in Graph Neural Networks [60.48306899271866]
We propose novel semantic data augmentation methods to incorporate backbone chemical and side-chain biophysical information into protein classification tasks. Specifically, we leverage molecular biophysical, secondary structure, chemical bonds, andionic features of proteins to facilitate classification tasks.
arXiv Detail & Related papers (2024-03-21T13:27:57Z)
Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs) We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness. Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z)
Learning multi-scale functional representations of proteins from single-cell microscopy data [77.34726150561087]
We show that simple convolutional networks trained on localization classification can learn protein representations that encapsulate diverse functional information. We also propose a robust evaluation strategy to assess quality of protein representations across different scales of biological function.
arXiv Detail & Related papers (2022-05-24T00:00:07Z)
Learning Geometrically Disentangled Representations of Protein Folding Simulations [72.03095377508856]
This work focuses on learning a generative neural network on a structural ensemble of a drug-target protein. Model tasks involve characterizing the distinct structural fluctuations of the protein bound to various drug molecules. Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations.
arXiv Detail & Related papers (2022-05-20T19:38:00Z)
Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins. In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information. We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z)
Binary classification of proteins by a Machine Learning approach [0.0]
We present a system capable of classifying protein chains of amino acids based on the protein description contained in the Protein Data Bank. Each protein is fully described in its chemical-physical-geometric properties in a file in XML format. The aim of the work is to design a Deep Learning machinery for the collection and management of a huge amount of data and to validate it through its application to the classification of a sequences of amino acids.
arXiv Detail & Related papers (2021-11-03T01:58:16Z)
PersGNN: Applying Topological Data Analysis and Geometric Deep Learning to Structure-Based Protein Function Prediction [0.07340017786387766]
In this work, we isolate protein structure to make functional annotations for proteins in the Protein Data Bank. We present PersGNN - an end-to-end trainable deep learning model that combines graph representation learning with topological data analysis.
arXiv Detail & Related papers (2020-10-30T02:24:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.