Aligning Proteins and Language: A Foundation Model for Protein Retrieval
- URL: http://arxiv.org/abs/2506.08023v1
- Date: Tue, 27 May 2025 08:13:08 GMT
- Title: Aligning Proteins and Language: A Foundation Model for Protein Retrieval
- Authors: Qifeng Wu, Zhengzhe Liu, Han Zhu, Yizhou Zhao, Daisuke Kihara, Min Xu,
- Abstract summary: This paper aims to retrieve proteins with similar structures and semantics from large-scale protein dataset.<n>Motivated by the recent progress of vision-caption models (VLMs), we propose a CLIP-style framework for aligning 3D protein structures with functional annotations.
- Score: 30.32156711268032
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper aims to retrieve proteins with similar structures and semantics from large-scale protein dataset, facilitating the functional interpretation of protein structures derived by structural determination methods like cryo-Electron Microscopy (cryo-EM). Motivated by the recent progress of vision-language models (VLMs), we propose a CLIP-style framework for aligning 3D protein structures with functional annotations using contrastive learning. For model training, we propose a large-scale dataset of approximately 200,000 protein-caption pairs with rich functional descriptors. We evaluate our model in both in-domain and more challenging cross-database retrieval on Protein Data Bank (PDB) and Electron Microscopy Data Bank (EMDB) dataset, respectively. In both cases, our approach demonstrates promising zero-shot retrieval performance, highlighting the potential of multimodal foundation models for structure-function understanding in protein biology.
Related papers
- Protein Large Language Models: A Comprehensive Survey [71.65899614084853]
Protein-specific large language models (Protein LLMs) are revolutionizing protein science by enabling more efficient protein structure prediction, function annotation, and design.<n>This work provides the first comprehensive overview of Protein LLMs, covering their architectures, training datasets, evaluation metrics, and diverse applications.
arXiv Detail & Related papers (2025-02-21T19:22:10Z) - SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models.
It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features.
Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z) - CPE-Pro: A Structure-Sensitive Deep Learning Method for Protein Representation and Origin Evaluation [7.161099050722313]
We develop a structure-sensitive supervised deep learning model, Crystal vs Predicted Evaluator for Protein Structure (CPE-Pro)
CPE-Pro learns the structural information of proteins and captures inter-structural differences to achieve accurate traceability on four data classes.
We utilize Foldseek to encode protein structures into "structure-sequences" and trained a protein Structural Sequence Language Model, SSLM.
arXiv Detail & Related papers (2024-10-21T02:21:56Z) - Endowing Protein Language Models with Structural Knowledge [5.587293092389789]
We introduce a novel framework that enhances protein language models by integrating protein structural data.
The refined model, termed Protein Structure Transformer (PST), is further pretrained on a small protein structure database.
PST consistently outperforms the state-of-the-art foundation model for protein sequences, ESM-2, setting a new benchmark in protein function prediction.
arXiv Detail & Related papers (2024-01-26T12:47:54Z) - Functional Geometry Guided Protein Sequence and Backbone Structure
Co-Design [12.585697288315846]
We propose a model to jointly design Protein sequence and structure based on automatically detected functional sites.
NAEPro is powered by an interleaving network of attention and equivariant layers, which can capture global correlation in a whole sequence.
Experimental results show that our model consistently achieves the highest amino acid recovery rate, TM-score, and the lowest RMSD among all competitors.
arXiv Detail & Related papers (2023-10-06T16:08:41Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - PSP: Million-level Protein Sequence Dataset for Protein Structure
Prediction [34.11168458572554]
We present the first million-level protein structure prediction dataset with high coverage and diversity, named as PSP.
This dataset consists of 570k true structure sequences (10TB) and 745k complementary distillation sequences (15TB)
We provide in addition the benchmark training procedure for SOTA protein structure prediction model on this dataset.
arXiv Detail & Related papers (2022-06-24T14:08:44Z) - Learning Geometrically Disentangled Representations of Protein Folding
Simulations [72.03095377508856]
This work focuses on learning a generative neural network on a structural ensemble of a drug-target protein.
Model tasks involve characterizing the distinct structural fluctuations of the protein bound to various drug molecules.
Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations.
arXiv Detail & Related papers (2022-05-20T19:38:00Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - PersGNN: Applying Topological Data Analysis and Geometric Deep Learning
to Structure-Based Protein Function Prediction [0.07340017786387766]
In this work, we isolate protein structure to make functional annotations for proteins in the Protein Data Bank.
We present PersGNN - an end-to-end trainable deep learning model that combines graph representation learning with topological data analysis.
arXiv Detail & Related papers (2020-10-30T02:24:35Z) - BERTology Meets Biology: Interpreting Attention in Protein Language
Models [124.8966298974842]
We demonstrate methods for analyzing protein Transformer models through the lens of attention.
We show that attention captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure.
We also present a three-dimensional visualization of the interaction between attention and protein structure.
arXiv Detail & Related papers (2020-06-26T21:50:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.