PLM-eXplain: Divide and Conquer the Protein Embedding Space
- URL: http://arxiv.org/abs/2504.07156v1
- Date: Wed, 09 Apr 2025 10:46:24 GMT
- Title: PLM-eXplain: Divide and Conquer the Protein Embedding Space
- Authors: Jan van Eck, Dea Gogishvili, Wilson Silva, Sanne Abeln,
- Abstract summary: We present an explainable adapter layer - PLM-eXplain (PLM-X)<n>PLM-X bridges the gap by factoring PLM embeddings into two components: an interpretable subspace based on established biochemical features, and a residual subspace that preserves the model's predictive power.<n>We demonstrate the effectiveness of our approach across three protein-level classification tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Protein language models (PLMs) have revolutionised computational biology through their ability to generate powerful sequence representations for diverse prediction tasks. However, their black-box nature limits biological interpretation and translation to actionable insights. We present an explainable adapter layer - PLM-eXplain (PLM-X), that bridges this gap by factoring PLM embeddings into two components: an interpretable subspace based on established biochemical features, and a residual subspace that preserves the model's predictive power. Using embeddings from ESM2, our adapter incorporates well-established properties, including secondary structure and hydropathy while maintaining high performance. We demonstrate the effectiveness of our approach across three protein-level classification tasks: prediction of extracellular vesicle association, identification of transmembrane helices, and prediction of aggregation propensity. PLM-X enables biological interpretation of model decisions without sacrificing accuracy, offering a generalisable solution for enhancing PLM interpretability across various downstream applications. This work addresses a critical need in computational biology by providing a bridge between powerful deep learning models and actionable biological insights.
Related papers
- GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.
Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.
It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Biology Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models [51.316001071698224]
We introduce Biology-Instructions, the first large-scale multi-omics biological sequences-related instruction-tuning dataset.<n>This dataset can bridge the gap between large language models (LLMs) and complex biological sequences-related tasks.<n>We also develop a strong baseline called ChatMultiOmics with a novel three-stage training pipeline.
arXiv Detail & Related papers (2024-12-26T12:12:23Z) - InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoencoders [24.150250149027883]
Protein language models (PLMs) have demonstrated remarkable success in protein modeling and design.<n>We present a systematic approach to extract and analyze interpretable features from PLMs using sparse autoencoders.<n>As practical applications, we demonstrate how these latent features can fill in missing annotations in protein databases.
arXiv Detail & Related papers (2024-11-13T18:51:21Z) - Long-context Protein Language Modeling Using Bidirectional Mamba with Shared Projection Layers [76.95505296417866]
Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design.<n>Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths.<n>In this work, we propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built upon selective structured state-space models.
arXiv Detail & Related papers (2024-10-29T16:43:28Z) - DPLM-2: A Multimodal Diffusion Protein Language Model [75.98083311705182]
We introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures.
DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals.
Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures.
arXiv Detail & Related papers (2024-10-17T17:20:24Z) - Diffusion Language Models Are Versatile Protein Learners [75.98083311705182]
This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences.
We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework.
After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation.
arXiv Detail & Related papers (2024-02-28T18:57:56Z) - Molecule Design by Latent Prompt Transformer [76.2112075557233]
This work explores the challenging problem of molecule design by framing it as a conditional generative modeling task.
We propose a novel generative model comprising three components: (1) a latent vector with a learnable prior distribution; (2) a molecule generation model based on a causal Transformer, which uses the latent vector as a prompt; and (3) a property prediction model that predicts a molecule's target properties and/or constraint values using the latent prompt.
arXiv Detail & Related papers (2024-02-27T03:33:23Z) - X-LoRA: Mixture of Low-Rank Adapter Experts, a Flexible Framework for Large Language Models with Applications in Protein Mechanics and Molecular Design [0.0]
We report a mixture of expert strategy to create fine-tuned large language models using a deep layer-wise token-level approach based on low-rank adaptation (LoRA)
The design is inspired by the biological principles of universality and diversity, where neural network building blocks are reused in different hierarchical manifestations.
We develop a tailored X-LoRA model that offers scientific capabilities including forward/inverse analysis tasks and enhanced reasoning capability, focused on biomaterial analysis, protein mechanics and design.
arXiv Detail & Related papers (2024-02-11T10:23:34Z) - Interface-Driven Peptide Folding: Quantum Computations on Simulated
Membrane Surfaces [0.0]
This study extends an existing quantum computing algorithm to address the complexities of antimicrobial peptide interactions at interfaces.
Our approach does not demand a higher number of qubits compared to simulations in homogeneous media, making it more feasible with current quantum computing resources.
arXiv Detail & Related papers (2024-01-10T11:18:19Z) - Modeling Dense Multimodal Interactions Between Biological Pathways and Histology for Survival Prediction [3.2274401541163322]
We propose a memory-efficient multimodal Transformer that can model interactions between pathway and histology patch tokens.
Our proposed model, SURVPATH, achieves state-of-the-art performance when evaluated against both unimodal and multimodal baselines.
arXiv Detail & Related papers (2023-04-13T21:02:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.