ESM-NBR: fast and accurate nucleic acid-binding residue prediction via
protein language model feature representation and multi-task learning
- URL: http://arxiv.org/abs/2312.00842v1
- Date: Fri, 1 Dec 2023 04:00:20 GMT
- Title: ESM-NBR: fast and accurate nucleic acid-binding residue prediction via
protein language model feature representation and multi-task learning
- Authors: Wenwu Zeng, Dafeng Lv, Wenjuan Liu, Shaoliang Peng
- Abstract summary: We propose a fast and accurate sequence-based method, called ESM-NBR, to predict nucleic acid-binding residues.
Experimental results on benchmark data sets demonstrate that the prediction performance of ESM2 feature representation comprehensively outperforms evolutionary information-based hidden Markov model (HMM) features.
By completely discarding the time-cost multiple sequence alignment process, the prediction speed of ESM-NBR far exceeds that of existing methods.
- Score: 1.6576008113462954
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Protein-nucleic acid interactions play a very important role in a variety of
biological activities. Accurate identification of nucleic acid-binding residues
is a critical step in understanding the interaction mechanisms. Although many
computationally based methods have been developed to predict nucleic
acid-binding residues, challenges remain. In this study, a fast and accurate
sequence-based method, called ESM-NBR, is proposed. In ESM-NBR, we first use
the large protein language model ESM2 to extract discriminative biological
properties feature representation from protein primary sequences; then, a
multi-task deep learning model composed of stacked bidirectional long
short-term memory (BiLSTM) and multi-layer perceptron (MLP) networks is
employed to explore common and private information of DNA- and RNA-binding
residues with ESM2 feature as input. Experimental results on benchmark data
sets demonstrate that the prediction performance of ESM2 feature representation
comprehensively outperforms evolutionary information-based hidden Markov model
(HMM) features. Meanwhile, the ESM-NBR obtains the MCC values for DNA-binding
residues prediction of 0.427 and 0.391 on two independent test sets, which are
18.61 and 10.45% higher than those of the second-best methods, respectively.
Moreover, by completely discarding the time-cost multiple sequence alignment
process, the prediction speed of ESM-NBR far exceeds that of existing methods
(5.52s for a protein sequence of length 500, which is about 16 times faster
than the second-fastest method). A user-friendly standalone package and the
data of ESM-NBR are freely available for academic use at:
https://github.com/wwzll123/ESM-NBR.
Related papers
- DiffNMR2: NMR Guided Sampling Acquisition Through Diffusion Model Uncertainty [2.4634393035848494]
We propose a novel sub-sampling strategy based on a diffusion model trained on protein NMR data.
Our method iteratively reconstructs under-sampled spectra while using model uncertainty to guide subsequent sampling, significantly reducing acquisition time.
This advancement holds promise for many applications, from drug discovery to materials science, where rapid and high-resolution spectral analysis is critical.
arXiv Detail & Related papers (2025-02-06T20:10:28Z) - Diffusion Model with Representation Alignment for Protein Inverse Folding [53.139837825588614]
Protein inverse folding is a fundamental problem in bioinformatics, aiming to recover the amino acid sequences from a given protein backbone structure.
We propose a novel method that leverages diffusion models with representation alignment (DMRA)
In experiments, we conduct extensive evaluations on the CATH4.2 dataset to demonstrate that DMRA outperforms leading methods.
arXiv Detail & Related papers (2024-12-12T15:47:59Z) - SeqProFT: Applying LoRA Finetuning for Sequence-only Protein Property Predictions [8.112057136324431]
This study employs the LoRA method to perform end-to-end fine-tuning of the ESM-2 model.
A multi-head attention mechanism is integrated into the downstream network to combine sequence features with contact map information.
arXiv Detail & Related papers (2024-11-18T12:40:39Z) - SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models.
It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features.
Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z) - Accelerating Drug Safety Assessment using Bidirectional-LSTM for SMILES Data [0.0]
Bi-Directional Long Short Term Memory (BiLSTM) is a variant of Recurrent Neural Network (RNN) that processes input molecular sequences.
The proposed work aims to understand the sequential patterns encoded in the SMILES strings, which are then utilised for predicting the toxicity of the molecules.
arXiv Detail & Related papers (2024-07-08T18:12:11Z) - NovoBench: Benchmarking Deep Learning-based De Novo Peptide Sequencing Methods in Proteomics [58.03989832372747]
We present the first unified benchmark NovoBench for emphde novo peptide sequencing.
It comprises diverse mass spectrum data, integrated models, and comprehensive evaluation metrics.
Recent methods, including DeepNovo, PointNovo, Casanovo, InstaNovo, AdaNovo and $pi$-HelixNovo are integrated into our framework.
arXiv Detail & Related papers (2024-06-16T08:23:21Z) - DisorderUnetLM: Validating ProteinUnet for efficient protein intrinsic disorder prediction [0.0]
The prediction of intrinsic disorder regions has significant implications for understanding protein functions and dynamics.
Recently, a new generation of predictors based on protein language models (pLMs) is emerging.
The article introduces the new DisorderUnetLM disorder predictor, which builds upon the idea of ProteinUnet.
arXiv Detail & Related papers (2024-04-11T20:14:14Z) - A Multi-Grained Symmetric Differential Equation Model for Learning Protein-Ligand Binding Dynamics [73.35846234413611]
In drug discovery, molecular dynamics (MD) simulation provides a powerful tool for predicting binding affinities, estimating transport properties, and exploring pocket sites.
We propose NeuralMD, the first machine learning (ML) surrogate that can facilitate numerical MD and provide accurate simulations in protein-ligand binding dynamics.
We demonstrate the efficiency and effectiveness of NeuralMD, achieving over 1K$times$ speedup compared to standard numerical MD simulations.
arXiv Detail & Related papers (2024-01-26T09:35:17Z) - Efficiently Predicting Protein Stability Changes Upon Single-point
Mutation with Large Language Models [51.57843608615827]
The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry.
We introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations.
arXiv Detail & Related papers (2023-12-07T03:25:49Z) - MATE-Pred: Multimodal Attention-based TCR-Epitope interaction Predictor [1.933856957193398]
An accurate binding prediction between T-cell receptors ands contributes decisively to successful immunotherapy strategies.
Here, we propose a highly reliable novel method, MATE-Pred, that performs attention-based prediction of T-cell receptors and affinitys binding regimes.
The performance of MATE-Pred projects its potential application in drug discovery.
arXiv Detail & Related papers (2023-12-05T11:30:00Z) - Accurate Machine Learned Quantum-Mechanical Force Fields for
Biomolecular Simulations [51.68332623405432]
Molecular dynamics (MD) simulations allow atomistic insights into chemical and biological processes.
Recently, machine learned force fields (MLFFs) emerged as an alternative means to execute MD simulations.
This work proposes a general approach to constructing accurate MLFFs for large-scale molecular simulations.
arXiv Detail & Related papers (2022-05-17T13:08:28Z) - Decoding the Protein-ligand Interactions Using Parallel Graph Neural
Networks [6.460973806588082]
We present a novel parallel graph neural network (GNN) to integrate knowledge representation and reasoning for PLI prediction.
Our method can serve as an interpretable and explainable artificial intelligence (AI) tool for predicted activity, potency, and biophysical properties of lead candidates.
arXiv Detail & Related papers (2021-11-30T06:02:04Z) - Confidence-guided Lesion Mask-based Simultaneous Synthesis of Anatomic
and Molecular MR Images in Patients with Post-treatment Malignant Gliomas [65.64363834322333]
Confidence Guided SAMR (CG-SAMR) synthesizes data from lesion information to multi-modal anatomic sequences.
module guides the synthesis based on confidence measure about the intermediate results.
experiments on real clinical data demonstrate that the proposed model can perform better than the state-of-theart synthesis methods.
arXiv Detail & Related papers (2020-08-06T20:20:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.