Related papers: Protein Representation Learning by Capturing Protein Sequence-Structure-Function Relationship

Protein Representation Learning by Capturing Protein Sequence-Structure-Function Relationship

URL: http://arxiv.org/abs/2405.06663v1
Date: Mon, 29 Apr 2024 05:42:29 GMT
Title: Protein Representation Learning by Capturing Protein Sequence-Structure-Function Relationship
Authors: Eunji Ko, Seul Lee, Minseon Kim, Dongki Kim,
Abstract summary: AMMA adopts a unified multi-modal encoder to integrate all three modalities into a unified representation space. AMMA is highly effective in learning protein representations that exhibit well-aligned inter-modal relationships.
Score: 12.11413472492417
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The goal of protein representation learning is to extract knowledge from protein databases that can be applied to various protein-related downstream tasks. Although protein sequence, structure, and function are the three key modalities for a comprehensive understanding of proteins, existing methods for protein representation learning have utilized only one or two of these modalities due to the difficulty of capturing the asymmetric interrelationships between them. To account for this asymmetry, we introduce our novel asymmetric multi-modal masked autoencoder (AMMA). AMMA adopts (1) a unified multi-modal encoder to integrate all three modalities into a unified representation space and (2) asymmetric decoders to ensure that sequence latent features reflect structural and functional information. The experiments demonstrate that the proposed AMMA is highly effective in learning protein representations that exhibit well-aligned inter-modal relationships, which in turn makes it effective for various downstream protein-related tasks.

Related papers

Bidirectional Hierarchical Protein Multi-Modal Representation Learning [4.682021474006426]
Protein language models (pLMs) pretrained on large scale protein sequences have demonstrated significant success in sequence-based tasks. graph neural networks (GNNs) designed to leverage 3D structural information have shown promising generalization in protein-related prediction tasks. Our framework employs attention and gating mechanisms to enable effective interaction between pLMs-generated sequential representations and GNN-extracted structural features.
arXiv Detail & Related papers (2025-04-07T06:47:49Z)
ProtDAT: A Unified Framework for Protein Sequence Design from Any Protein Text Description [7.198238666986253]
We propose a de novo fine-grained framework capable of designing proteins from any descriptive text input. Prot DAT builds upon the inherent characteristics of protein data to unify sequences and text as a cohesive whole rather than separate entities. Experimental results demonstrate that Prot DAT achieves the state-of-the-art performance in protein sequence generation, excelling in rationality, functionality, structural similarity, and validity.
arXiv Detail & Related papers (2024-12-05T11:05:46Z)
SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models. It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features. Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z)
Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding [43.811432723460534]
We introduce Structure-Enhanced Protein Instruction Tuning (SEPIT) framework to bridge this gap. Our approach integrates a noval structure-aware module into pLMs to inform them with structural knowledge, and then connects these enhanced pLMs to large language models (LLMs) to generate understanding of proteins. We construct the largest and most comprehensive protein instruction dataset to date, which allows us to train and evaluate the general-purpose protein understanding model.
arXiv Detail & Related papers (2024-10-04T16:02:50Z)
A PLMs based protein retrieval framework [4.110243520064533]
We propose a novel protein retrieval framework that mitigates the bias towards sequence similarity. Our framework initiatively harnesses protein language models (PLMs) to embed protein sequences within a high-dimensional feature space. Extensive experiments demonstrate that our framework can equally retrieve both similar and dissimilar proteins.
arXiv Detail & Related papers (2024-07-16T09:52:42Z)
ProtT3: Protein-to-Text Generation for Text-based Protein Understanding [88.43323947543996]
Language Models (LMs) excel in understanding textual descriptions of proteins. Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to process texts. We introduce ProtT3, a framework for Protein-to-Text Generation for Text-based Protein Understanding.
arXiv Detail & Related papers (2024-05-21T08:06:13Z)
A Text-guided Protein Design Framework [106.79061950107922]
We propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design. ProteinDT consists of three subsequent steps: ProteinCLAP which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that creates the protein sequences from the representation. We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90% accuracy for text-guided protein generation; (2) best hit ratio on 12 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks.
arXiv Detail & Related papers (2023-02-09T12:59:16Z)
Learning Geometrically Disentangled Representations of Protein Folding Simulations [72.03095377508856]
This work focuses on learning a generative neural network on a structural ensemble of a drug-target protein. Model tasks involve characterizing the distinct structural fluctuations of the protein bound to various drug molecules. Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations.
arXiv Detail & Related papers (2022-05-20T19:38:00Z)
Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins. In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information. We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z)
Intrinsic-Extrinsic Convolution and Pooling for Learning on 3D Protein Structures [18.961218808251076]
We propose two new learning operations enabling deep 3D analysis of large-scale protein data. First, we introduce a novel convolution operator which considers both, the intrinsic (invariant under protein folding) as well as extrinsic (invariant under bonding) structure. Second, we enable a multi-scale protein analysis by introducing hierarchical pooling operators, exploiting the fact that proteins are a recombination of a finite set of amino acids.
arXiv Detail & Related papers (2020-07-13T09:02:40Z)
BERTology Meets Biology: Interpreting Attention in Protein Language Models [124.8966298974842]
We demonstrate methods for analyzing protein Transformer models through the lens of attention. We show that attention captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure. We also present a three-dimensional visualization of the interaction between attention and protein structure.
arXiv Detail & Related papers (2020-06-26T21:50:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.