Protein Representation Learning by Capturing Protein Sequence-Structure-Function Relationship
- URL: http://arxiv.org/abs/2405.06663v1
- Date: Mon, 29 Apr 2024 05:42:29 GMT
- Title: Protein Representation Learning by Capturing Protein Sequence-Structure-Function Relationship
- Authors: Eunji Ko, Seul Lee, Minseon Kim, Dongki Kim,
- Abstract summary: AMMA adopts a unified multi-modal encoder to integrate all three modalities into a unified representation space.
AMMA is highly effective in learning protein representations that exhibit well-aligned inter-modal relationships.
- Score: 12.11413472492417
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The goal of protein representation learning is to extract knowledge from protein databases that can be applied to various protein-related downstream tasks. Although protein sequence, structure, and function are the three key modalities for a comprehensive understanding of proteins, existing methods for protein representation learning have utilized only one or two of these modalities due to the difficulty of capturing the asymmetric interrelationships between them. To account for this asymmetry, we introduce our novel asymmetric multi-modal masked autoencoder (AMMA). AMMA adopts (1) a unified multi-modal encoder to integrate all three modalities into a unified representation space and (2) asymmetric decoders to ensure that sequence latent features reflect structural and functional information. The experiments demonstrate that the proposed AMMA is highly effective in learning protein representations that exhibit well-aligned inter-modal relationships, which in turn makes it effective for various downstream protein-related tasks.
Related papers
- ProtT3: Protein-to-Text Generation for Text-based Protein Understanding [88.43323947543996]
Language Models (LMs) excel in understanding textual descriptions of proteins.
Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to process texts.
We introduce ProtT3, a framework for Protein-to-Text Generation for Text-based Protein Understanding.
arXiv Detail & Related papers (2024-05-21T08:06:13Z) - ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases.
Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions.
We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z) - Protein-ligand binding representation learning from fine-grained
interactions [29.965890962846093]
We propose to learn protein-ligand binding representation in a self-supervised learning manner.
This self-supervised learning problem is formulated as a prediction of the conclusive binding complex structure.
Experiments have demonstrated the superiority of our method across various binding tasks.
arXiv Detail & Related papers (2023-11-09T01:33:09Z) - A Text-guided Protein Design Framework [109.18157766856196]
We propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design.
ProteinDT consists of three subsequent steps: ProteinCLAP which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that creates the protein sequences from the representation.
We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90% accuracy for text-guided protein generation; (2) best hit ratio on 10 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks.
arXiv Detail & Related papers (2023-02-09T12:59:16Z) - Learning Geometrically Disentangled Representations of Protein Folding
Simulations [72.03095377508856]
This work focuses on learning a generative neural network on a structural ensemble of a drug-target protein.
Model tasks involve characterizing the distinct structural fluctuations of the protein bound to various drug molecules.
Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations.
arXiv Detail & Related papers (2022-05-20T19:38:00Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - Intrinsic-Extrinsic Convolution and Pooling for Learning on 3D Protein
Structures [18.961218808251076]
We propose two new learning operations enabling deep 3D analysis of large-scale protein data.
First, we introduce a novel convolution operator which considers both, the intrinsic (invariant under protein folding) as well as extrinsic (invariant under bonding) structure.
Second, we enable a multi-scale protein analysis by introducing hierarchical pooling operators, exploiting the fact that proteins are a recombination of a finite set of amino acids.
arXiv Detail & Related papers (2020-07-13T09:02:40Z) - BERTology Meets Biology: Interpreting Attention in Protein Language
Models [124.8966298974842]
We demonstrate methods for analyzing protein Transformer models through the lens of attention.
We show that attention captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure.
We also present a three-dimensional visualization of the interaction between attention and protein structure.
arXiv Detail & Related papers (2020-06-26T21:50:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.