Related papers: A general language model for peptide identification

A general language model for peptide identification

URL: http://arxiv.org/abs/2502.15610v2
Date: Thu, 17 Apr 2025 17:52:57 GMT
Title: A general language model for peptide identification
Authors: Jixiu Zhai, Tianchi Lu, Haitian Zhong, Ziyang Xu, Yuhuan Liu, Shengrui Xu, Jingwan Wang, Dan Huang,
Abstract summary: PDeepPP is a deep learning framework that integrates pretrained protein language models with parallel transformer-CNN architectures.<n>The model's hybrid architecture demonstrates unique capabilities in capturing both local sequence motifs and global structural features.<n>It achieved 218* acceleration over sequence-alignment-based methods while maintaining 99.5% specificity in critical glycosylation site detection.
Score: 4.044600688588866
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Advances in peptide identification are revolutionizing our ability to decipher protein functions and accelerate therapeutic discovery. We present PDeepPP, a deep learning framework that integrates pretrained protein language models with parallel transformer-CNN architectures, achieving state-of-the-art performance in peptide characterization tasks. The model's hybrid architecture demonstrates unique capabilities in capturing both local sequence motifs and global structural features, as evidenced by 29% improved cluster separation in UMAP visualizations compared to conventional approaches. Evaluated across 33 biological recognition tasks - including post-translational modification site prediction and bioactive peptide identification - PDeepPP outperformed existing methods in 25 tasks with average AUC improvements of 4.2%. Notably, it achieved 0.9726 accuracy with PR AUC 0.9977 in antimicrobial peptide detection while reducing false negatives by 37.5% in antimalarial recognition scenarios. This framework enables accurate large-scale peptide analysis, achieving 218* acceleration over sequence-alignment-based methods while maintaining 99.5% specificity in critical glycosylation site detection.PDeepPP establishes a new paradigm for computational peptide analysis through its synergistic architecture design, enabling rapid yet precise functional annotation that bridges molecular pattern recognition with translational biomedical applications.We have made our implementation, including code, data, and pretrained models, publicly available via GitHub (https://github.com/fondress/PDeepPP) and Hugging Face (https://huggingface.co/fondress/PDeppPP).

Related papers

A Multimodal Human Protein Embeddings Database: DeepDrug Protein Embeddings Bank (DPEB) [0.3822990432531661]
DPEB is a curated collection of 22,043 human proteins that integrates four embedding types.<n> DPEB supports multiple graph neural network methods for PPI prediction.
arXiv Detail & Related papers (2025-10-24T20:22:17Z)
ResCap-DBP: A Lightweight Residual-Capsule Network for Accurate DNA-Binding Protein Prediction Using Global ProteinBERT Embeddings [9.626183317998143]
We propose a novel deep learning framework, ResCap-DBP, that combines a residual learning-based encoder with a one-dimensional Capsule Network.<n>ProteinBERT embeddings substantially outperform other representations on large datasets.<n>Our model consistently outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2025-07-27T21:54:32Z)
PRING: Rethinking Protein-Protein Interaction Prediction from Pairs to Graphs [80.08310253195144]
PRING is the first benchmark that evaluates protein-protein interaction prediction from a graph-level perspective.<n> PRING curates a high-quality, multi-species PPI network dataset comprising 21,484 proteins and 186,818 interactions.
arXiv Detail & Related papers (2025-07-07T15:21:05Z)
ProtCLIP: Function-Informed Protein Multi-Modal Learning [18.61302416993122]
We develop ProtCLIP, a multi-modality foundation model that represents function-aware protein embeddings.<n>Our ProtCLIP consistently achieves SOTA performance, with remarkable improvements of 75% on average in five cross-modal transformation benchmarks.<n>The experimental results verify the extraordinary potential of ProtCLIP serving as the protein multi-modality foundation model.
arXiv Detail & Related papers (2024-12-28T04:23:47Z)
Multi-modal Representation Learning Enables Accurate Protein Function Prediction in Low-Data Setting [0.0]
HOPER (HOlistic ProtEin Representation) is a novel framework designed to enhance protein function prediction (PFP) in low-data settings.<n>Our results highlight the effectiveness of multimodal representation learning for overcoming data limitations in biological research.
arXiv Detail & Related papers (2024-11-22T20:13:55Z)
MeToken: Uniform Micro-environment Token Boosts Post-Translational Modification Prediction [65.33218256339151]
Post-translational modifications (PTMs) profoundly expand the complexity and functionality of the proteome. Existing computational approaches predominantly focus on protein sequences to predict PTM sites, driven by the recognition of sequence-dependent motifs. We introduce the MeToken model, which tokenizes the micro-environment of each acid, integrating both sequence and structural information into unified discrete tokens.
arXiv Detail & Related papers (2024-11-04T07:14:28Z)
Peptide-GPT: Generative Design of Peptides using Generative Pre-trained Transformers and Bio-informatic Supervision [7.275932354889042]
We introduce a protein language model tailored to generate protein sequences with distinct properties. We rank the generated sequences based on their perplexity scores, then we filter out those lying outside the permissible convex hull of proteins. We achieved an accuracy of 76.26% in hemolytic, 72.46% in non-hemolytic, 78.84% in non-fouling, and 68.06% in solubility protein generation.
arXiv Detail & Related papers (2024-10-25T00:15:39Z)
Multi-Peptide: Multimodality Leveraged Language-Graph Learning of Peptide Properties [5.812284760539713]
Multi-Peptide is an innovative approach that combines transformer-based language models with Graph Neural Networks (GNNs) to predict peptide properties. Evaluations on hemolysis and nonfouling datasets demonstrate Multi-Peptide's robustness, achieving state-of-the-art 86.185% accuracy in hemolysis prediction. This study highlights the potential of multimodal learning in bioinformatics, paving the way for accurate and reliable predictions in peptide-based research and applications.
arXiv Detail & Related papers (2024-07-02T20:13:47Z)
ContactNet: Geometric-Based Deep Learning Model for Predicting Protein-Protein Interactions [2.874893537471256]
We develop a novel attention-based Graph Neural Network (GNN), ContactNet, for classifying PPI models into accurate and incorrect ones. When trained on docked antigen and modeled antibody structures, ContactNet doubles the accuracy of current state-of-the-art scoring functions.
arXiv Detail & Related papers (2024-06-26T12:54:41Z)
NovoBench: Benchmarking Deep Learning-based De Novo Peptide Sequencing Methods in Proteomics [58.03989832372747]
We present the first unified benchmark NovoBench for emphde novo peptide sequencing. It comprises diverse mass spectrum data, integrated models, and comprehensive evaluation metrics. Recent methods, including DeepNovo, PointNovo, Casanovo, InstaNovo, AdaNovo and $pi$-HelixNovo are integrated into our framework.
arXiv Detail & Related papers (2024-06-16T08:23:21Z)
PPFlow: Target-aware Peptide Design with Torsional Flow Matching [52.567714059931646]
We propose a target-aware peptide design method called textscPPFlow to model the internal geometries of torsion angles for the peptide structure design. Besides, we establish a protein-peptide binding dataset named PPBench2024 to fill the void of massive data for the task of structure-based peptide drug design.
arXiv Detail & Related papers (2024-03-05T13:26:42Z)
MAPE-PPI: Towards Effective and Efficient Protein-Protein Interaction Prediction via Microenvironment-Aware Protein Embedding [82.31506767274841]
Protein-Protein Interactions (PPIs) are fundamental in various biological processes and play a key role in life activities. MPAE-PPI encodes microenvironments into chemically meaningful discrete codes via a sufficiently large microenvironment "vocabulary" MPAE-PPI can scale to PPI prediction with millions of PPIs with superior trade-offs between effectiveness and computational efficiency.
arXiv Detail & Related papers (2024-02-22T09:04:41Z)
Efficiently Predicting Protein Stability Changes Upon Single-point Mutation with Large Language Models [51.57843608615827]
The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry. We introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations.
arXiv Detail & Related papers (2023-12-07T03:25:49Z)
PTransIPs: Identification of phosphorylation sites enhanced by protein PLM embeddings [2.971764950146918]
We develop PTransIPs, a new deep learning framework for the identification of phosphorylation sites. PTransIPs outperforms existing state-of-the-art (SOTA) methods, achieving AUCs of 0.9232 and 0.9660.
arXiv Detail & Related papers (2023-08-08T07:50:38Z)
Efficient Prediction of Peptide Self-assembly through Sequential and Graphical Encoding [57.89530563948755]
This work provides a benchmark analysis of peptide encoding with advanced deep learning models. It serves as a guide for a wide range of peptide-related predictions such as isoelectric points, hydration free energy, etc.
arXiv Detail & Related papers (2023-07-17T00:43:33Z)
State-specific protein-ligand complex structure prediction with a multi-scale deep generative model [68.28309982199902]
We present NeuralPLexer, a computational approach that can directly predict protein-ligand complex structures. Our study suggests that a data-driven approach can capture the structural cooperativity between proteins and small molecules, showing promise in accelerating the design of enzymes, drug molecules, and beyond.
arXiv Detail & Related papers (2022-09-30T01:46:38Z)
Unsupervisedly Prompting AlphaFold2 for Few-Shot Learning of Accurate Folding Landscape and Protein Structure Prediction [28.630603355510324]
We present EvoGen, a meta generative model, to remedy the underperformance of AlphaFold2 for poor MSA targets. By prompting the model with calibrated or virtually generated homologue sequences, EvoGen helps AlphaFold2 fold accurately in low-data regime.
arXiv Detail & Related papers (2022-08-20T10:23:17Z)
From Static to Dynamic Structures: Improving Binding Affinity Prediction with Graph-Based Deep Learning [40.83037811977803]
Dynaformer is a graph-based deep learning model developed to predict protein-ligand binding affinities. It exhibits state-of-the-art scoring and ranking power on the CASF-2016 benchmark dataset. In a virtual screening on heat shock protein 90 (HSP90), 20 candidates are identified and their binding affinities are experimentally validated.
arXiv Detail & Related papers (2022-08-19T14:55:12Z)
Improved Drug-target Interaction Prediction with Intermolecular Graph Transformer [98.8319016075089]
We propose a novel approach to model intermolecular information with a three-way Transformer-based architecture. Intermolecular Graph Transformer (IGT) outperforms state-of-the-art approaches by 9.1% and 20.5% over the second best for binding activity and binding pose prediction respectively. IGT exhibits promising drug screening ability against SARS-CoV-2 by identifying 83.1% active drugs that have been validated by wet-lab experiments with near-native predicted binding poses.
arXiv Detail & Related papers (2021-10-14T13:28:02Z)
EBM-Fold: Fully-Differentiable Protein Folding Powered by Energy-based Models [53.17320541056843]
We propose a fully-differentiable approach for protein structure optimization, guided by a data-driven generative network. Our EBM-Fold approach can efficiently produce high-quality decoys, compared against traditional Rosetta-based structure optimization routines.
arXiv Detail & Related papers (2021-05-11T03:40:29Z)
Assigning function to protein-protein interactions: a weakly supervised BioBERT based approach using PubMed abstracts [2.208694022993555]
Protein-protein interactions (PPI) are critical to the function of proteins in both normal and diseased cells. Only a small percentage of PPIs captured in protein interaction databases have annotations of function available. Here, we aim to label the function type of PPIs by extracting relationships described in PubMed abstracts.
arXiv Detail & Related papers (2020-08-20T01:42:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.