PRISM: Enhancing Protein Inverse Folding through Fine-Grained Retrieval on Structure-Sequence Multimodal Representations
- URL: http://arxiv.org/abs/2510.11750v1
- Date: Sun, 12 Oct 2025 00:45:22 GMT
- Title: PRISM: Enhancing Protein Inverse Folding through Fine-Grained Retrieval on Structure-Sequence Multimodal Representations
- Authors: Sazan Mahbub, Souvik Kundu, Eric P. Xing,
- Abstract summary: We present PRISM, a multimodal retrieval-augmented generation framework for inverse folding.<n>It retrieves fine-grained representations of potential motifs from known proteins and integrates them with a hybrid self-cross attention decoder.<n> PRISM establishes new state of the art in both perplexity and amino acid recovery, while also improving foldability metrics.
- Score: 42.870409939729974
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Designing protein sequences that fold into a target three-dimensional structure, known as the inverse folding problem, is central to protein engineering but remains challenging due to the vast sequence space and the importance of local structural constraints. Existing deep learning approaches achieve strong recovery rates, yet they lack explicit mechanisms to reuse fine-grained structure-sequence patterns that are conserved across natural proteins. We present PRISM, a multimodal retrieval-augmented generation framework for inverse folding that retrieves fine-grained representations of potential motifs from known proteins and integrates them with a hybrid self-cross attention decoder. PRISM is formulated as a latent-variable probabilistic model and implemented with an efficient approximation, combining theoretical grounding with practical scalability. Across five benchmarks (CATH-4.2, TS50, TS500, CAMEO 2022, and the PDB date split), PRISM establishes new state of the art in both perplexity and amino acid recovery, while also improving foldability metrics (RMSD, TM-score, pLDDT), demonstrating that fine-grained multimodal retrieval is a powerful and efficient paradigm for protein sequence design.
Related papers
- Protein Autoregressive Modeling via Multiscale Structure Generation [51.92004892768298]
We present protein autoregressive modeling (PAR), the first multi-scale autoregressive framework for protein backbone generation.<n>We adopt noisy context learning and scheduled sampling, enabling robust backbone generation.<n>On the unconditional generation benchmark, PAR effectively learns protein distributions and produces backbones of high design quality.
arXiv Detail & Related papers (2026-02-04T18:59:49Z) - RIGA-Fold: A General Framework for Protein Inverse Folding via Recurrent Interaction and Geometric Awareness [14.42786271490985]
RIGA-Fold is a framework that synergizes Recurrent Interaction with Geometric Awareness.<n>To bridge the gap between structural and sequence modalities, we introduce an enhanced variant, RIGA-Fold*.<n>Our framework significantly outperforms state-of-the-art baselines in both sequence recovery and structural consistency.
arXiv Detail & Related papers (2026-02-04T15:07:13Z) - ProteinAE: Protein Diffusion Autoencoders for Structure Encoding [64.77182442408254]
We introduce ProteinAE, a novel and streamlined protein diffusion autoencoder.<n>ProteinAE directly maps protein backbone coordinates from E(3) into a continuous, compact latent space.<n>We demonstrate that ProteinAE achieves state-of-the-art reconstruction quality, outperforming existing autoencoders.
arXiv Detail & Related papers (2025-10-12T14:30:32Z) - Multi-state Protein Design with DynamicMPNN [2.8456027933151993]
Existing multi-state design approaches rely on post-hoc aggregation of singlestate predictions.<n>We introduce DynamicMPNN, an inverse model explicitly trained to generate sequences compatible with multiple conformations.
arXiv Detail & Related papers (2025-07-29T15:51:26Z) - Lattice Protein Folding with Variational Annealing [2.164205569823082]
We introduce a novel training scheme that employs masking to identify the lowest-energy folds in two-dimensional Hydrophobic-Polar (HP) lattice protein folding.<n>Our findings emphasize the potential of advanced machine learning techniques in tackling complex protein folding problems.
arXiv Detail & Related papers (2025-02-28T01:30:15Z) - Fast and Accurate Antibody Sequence Design via Structure Retrieval [32.38989928131971]
Igseek is a novel structure-retrieval framework that infers sequences by similar structures from a natural antibody database.<n>Our experiments demonstrate that Igseek not only proves to be highly efficient in structural retrieval but also outperforms state-of-the-art approaches in sequence recovery for both antibodies and T-Cell Receptors.
arXiv Detail & Related papers (2025-02-11T13:29:49Z) - DPLM-2: A Multimodal Diffusion Protein Language Model [75.98083311705182]
We introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures.
DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals.
Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures.
arXiv Detail & Related papers (2024-10-17T17:20:24Z) - Learning the Language of Protein Structure [8.364087723533537]
We introduce an approach using a vector-quantized autoencoder that effectively tokenizes protein structures into discrete representations.<n>To demonstrate the efficacy of our learned representations, we show that a simple GPT model trained on our codebooks can generate novel, diverse, and designable protein structures.
arXiv Detail & Related papers (2024-05-24T16:03:47Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - State-specific protein-ligand complex structure prediction with a
multi-scale deep generative model [68.28309982199902]
We present NeuralPLexer, a computational approach that can directly predict protein-ligand complex structures.
Our study suggests that a data-driven approach can capture the structural cooperativity between proteins and small molecules, showing promise in accelerating the design of enzymes, drug molecules, and beyond.
arXiv Detail & Related papers (2022-09-30T01:46:38Z) - EBM-Fold: Fully-Differentiable Protein Folding Powered by Energy-based
Models [53.17320541056843]
We propose a fully-differentiable approach for protein structure optimization, guided by a data-driven generative network.
Our EBM-Fold approach can efficiently produce high-quality decoys, compared against traditional Rosetta-based structure optimization routines.
arXiv Detail & Related papers (2021-05-11T03:40:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.