Related papers: Deep Learning Model for Amyloidogenicity Prediction using a Pre-trained Protein LLM

Deep Learning Model for Amyloidogenicity Prediction using a Pre-trained Protein LLM

URL: http://arxiv.org/abs/2508.12575v1
Date: Mon, 18 Aug 2025 02:21:48 GMT
Title: Deep Learning Model for Amyloidogenicity Prediction using a Pre-trained Protein LLM
Authors: Zohra Yagoub, Hafida Bouziane,
Abstract summary: Recent approaches to predicting amyloidogenicity within proteins are highly based on evolutionary motifs and the individual properties of amino acids.<n>Our study evaluated the contextual features of protein sequences obtained from a pretrained protein large language model.<n>Our method achieved an accuracy of 84.5% on 10-fold cross-validation and an accuracy of 83% in the test dataset.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The prediction of amyloidogenicity in peptides and proteins remains a focal point of ongoing bioinformatics. The crucial step in this field is to apply advanced computational methodologies. Many recent approaches to predicting amyloidogenicity within proteins are highly based on evolutionary motifs and the individual properties of amino acids. It is becoming increasingly evident that the sequence information-based features show high predictive performance. Consequently, our study evaluated the contextual features of protein sequences obtained from a pretrained protein large language model leveraging bidirectional LSTM and GRU to predict amyloidogenic regions in peptide and protein sequences. Our method achieved an accuracy of 84.5% on 10-fold cross-validation and an accuracy of 83% in the test dataset. Our results demonstrate competitive performance, highlighting the potential of LLMs in enhancing the accuracy of amyloid prediction.

Related papers

Self Distillation Fine-Tuning of Protein Language Models Improves Versatility in Protein Design [61.2846583160056]
Supervised fine-tuning (SFT) is a standard approach for adapting large language models to specialized domains.<n>This is in part because high-quality annotated data are far more difficult to obtain for proteins than for natural language.<n>We present a simple and general recipe for fast SFT of PLMs, designed to improve the fidelity, reliability, and novelty of generated protein sequences.
arXiv Detail & Related papers (2025-12-10T05:34:47Z)
PLAME: Leveraging Pretrained Language Models to Generate Enhanced Protein Multiple Sequence Alignments [53.55710514466851]
Protein structure prediction is essential for drug discovery and understanding biological functions.<n>Most folding models rely heavily on multiple sequence alignments (MSAs) to boost prediction performance.<n>We propose PLAME, a novel MSA design model that leverages evolutionary embeddings from pretrained protein language models.
arXiv Detail & Related papers (2025-06-17T04:11:30Z)
Protein Large Language Models: A Comprehensive Survey [71.65899614084853]
Protein-specific large language models (Protein LLMs) are revolutionizing protein science by enabling more efficient protein structure prediction, function annotation, and design.<n>This work provides the first comprehensive overview of Protein LLMs, covering their architectures, training datasets, evaluation metrics, and diverse applications.
arXiv Detail & Related papers (2025-02-21T19:22:10Z)
Computational Protein Science in the Era of Large Language Models (LLMs) [54.35488233989787]
Computational protein science is dedicated to revealing knowledge and developing applications within the protein sequence-structure-function paradigm.<n>Recently, Language Models (pLMs) have emerged as a milestone in AI due to their unprecedented language processing & generalization capability.
arXiv Detail & Related papers (2025-01-17T16:21:18Z)
SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models. It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features. Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z)
EMOCPD: Efficient Attention-based Models for Computational Protein Design Using Amino Acid Microenvironment [8.661662081290265]
We develop Efficient attention-based Models for Computational Protein Design using amino acid microenvironment (EMOCPD) It aims to predict the category of each amino acid in a protein by analyzing the three-dimensional atomic environment surrounding the amino acids, and optimize the protein based on the predicted high-probability potential amino acid categories. It achieves over 80% accuracy on the training set and 68.33% and 62.32% accuracy on two independent test sets, respectively.
arXiv Detail & Related papers (2024-10-28T14:31:18Z)
pLDDT-Predictor: High-speed Protein Screening Using Transformer and ESM2 [3.9703338485541244]
We introduce pLDDT-Predictor, a high-speed protein screening tool that achieves a $250,000times$ speedup compared to AlphaFold2.<n>Our model predicts AlphaFold2's pLDDT scores with a Pearson correlation of 0.7891 and processes proteins in just 0.007 seconds on average.<n>Using a comprehensive dataset of 1.5 million diverse protein sequences, we demonstrate that pLDDT-Predictor accurately classifies high-confidence structures.
arXiv Detail & Related papers (2024-10-11T03:19:44Z)
Protein-Mamba: Biological Mamba Models for Protein Function Prediction [18.642511763423048]
Protein-Mamba is a novel two-stage model that leverages both self-supervised learning and fine-tuning to improve protein function prediction. Our experiments demonstrate that Protein-Mamba achieves competitive performance, compared with a couple of state-of-the-art methods.
arXiv Detail & Related papers (2024-09-22T22:51:56Z)
Peptide Sequencing Via Protein Language Models [0.0]
We introduce a protein language model for determining the complete sequence of a peptide based on measurement of a limited set of amino acids. Our method simulates partial sequencing data by selectively masking amino acids that are experimentally difficult to identify. We achieve per-amino-acid accuracy up to 90.5% when only four amino acids are known.
arXiv Detail & Related papers (2024-08-01T20:12:49Z)
Beyond ESM2: Graph-Enhanced Protein Sequence Modeling with Efficient Clustering [24.415612744612773]
Proteins are essential to life's processes, underpinning evolution and diversity. Advances in sequencing technology have revealed millions of proteins, underscoring the need for sophisticated pre-trained protein models for biological analysis and AI development. Facebook's ESM2, the most advanced protein language model to date, leverages a masked prediction task for unsupervised learning, crafting amino acid representations with notable biochemical accuracy. Yet, it lacks in delivering functional protein insights, signaling an opportunity for enhancing representation quality. This study addresses this gap by incorporating protein family classification into ESM2's training, while a contextual prediction task fine-tunes local
arXiv Detail & Related papers (2024-04-24T11:09:43Z)
Efficiently Predicting Protein Stability Changes Upon Single-point Mutation with Large Language Models [51.57843608615827]
The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry. We introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations.
arXiv Detail & Related papers (2023-12-07T03:25:49Z)
Efficient Prediction of Peptide Self-assembly through Sequential and Graphical Encoding [57.89530563948755]
This work provides a benchmark analysis of peptide encoding with advanced deep learning models. It serves as a guide for a wide range of peptide-related predictions such as isoelectric points, hydration free energy, etc.
arXiv Detail & Related papers (2023-07-17T00:43:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.