Related papers: Curriculum Learning for Biological Sequence Prediction: The Case of De Novo Peptide Sequencing

Curriculum Learning for Biological Sequence Prediction: The Case of De Novo Peptide Sequencing

URL: http://arxiv.org/abs/2506.13485v1
Date: Mon, 16 Jun 2025 13:44:25 GMT
Title: Curriculum Learning for Biological Sequence Prediction: The Case of De Novo Peptide Sequencing
Authors: Xiang Zhang, Jiaqi Wei, Zijie Qiu, Sheng Xu, Nanqing Dong, Zhiqiang Gao, Siqi Sun,
Abstract summary: We propose an improved non-autoregressive peptide sequencing model that incorporates a structured protein sequence curriculum learning strategy.<n>Our curriculum learning strategy reduces NAT training failures frequency by more than 90% based on sampled training over various data distributions.
Score: 21.01399785232482
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Peptide sequencing-the process of identifying amino acid sequences from mass spectrometry data-is a fundamental task in proteomics. Non-Autoregressive Transformers (NATs) have proven highly effective for this task, outperforming traditional methods. Unlike autoregressive models, which generate tokens sequentially, NATs predict all positions simultaneously, leveraging bidirectional context through unmasked self-attention. However, existing NAT approaches often rely on Connectionist Temporal Classification (CTC) loss, which presents significant optimization challenges due to CTC's complexity and increases the risk of training failures. To address these issues, we propose an improved non-autoregressive peptide sequencing model that incorporates a structured protein sequence curriculum learning strategy. This approach adjusts protein's learning difficulty based on the model's estimated protein generational capabilities through a sampling process, progressively learning peptide generation from simple to complex sequences. Additionally, we introduce a self-refining inference-time module that iteratively enhances predictions using learned NAT token embeddings, improving sequence accuracy at a fine-grained level. Our curriculum learning strategy reduces NAT training failures frequency by more than 90% based on sampled training over various data distributions. Evaluations on nine benchmark species demonstrate that our approach outperforms all previous methods across multiple metrics and species.

Related papers

Zero-Shot Learning with Subsequence Reordering Pretraining for Compound-Protein Interaction [39.13469810619366]
We propose a novel approach that pretrains protein representations for CPI prediction tasks using subsequence reordering.<n>We apply length-variable protein augmentation to ensure excellent pretraining performance on small training datasets.<n>Compared to existing pre-training models, our model demonstrates superior performance, particularly in data-scarce scenarios.
arXiv Detail & Related papers (2025-07-28T15:31:15Z)
Orthogonal Projection Subspace to Aggregate Online Prior-knowledge for Continual Test-time Adaptation [67.80294336559574]
Continual Test Time Adaptation (CTTA) is a task that requires a source pre-trained model to continually adapt to new scenarios.<n>We propose a novel pipeline, Orthogonal Projection Subspace to aggregate online Prior-knowledge, dubbed OoPk.
arXiv Detail & Related papers (2025-06-23T18:17:39Z)
Universal Biological Sequence Reranking for Improved De Novo Peptide Sequencing [32.29218860420551]
RankNovo is the first deep reranking framework that enhances de novo peptide sequencing.<n>Our work presents a novel reranking strategy that challenges existing single-model paradigms and advances the frontier of accurate de novo sequencing.
arXiv Detail & Related papers (2025-05-23T06:56:55Z)
A general language model for peptide identification [4.044600688588866]
PDeepPP is a deep learning framework that integrates pretrained protein language models with parallel transformer-CNN architectures.<n>The model's hybrid architecture demonstrates unique capabilities in capturing both local sequence motifs and global structural features.<n>It achieved 218* acceleration over sequence-alignment-based methods while maintaining 99.5% specificity in critical glycosylation site detection.
arXiv Detail & Related papers (2025-02-21T17:31:22Z)
Reinforcement Learning for Sequence Design Leveraging Protein Language Models [14.477268882311991]
We propose to use protein language models (PLMs) as a reward function to generate new sequences. We perform extensive experiments on various sequence lengths to benchmark RL-based approaches. We provide comprehensive evaluations along biological plausibility and diversity of the protein.
arXiv Detail & Related papers (2024-07-03T14:31:36Z)
Boosting Adversarial Training via Fisher-Rao Norm-based Regularization [9.975998980413301]
We propose a novel regularization framework, called Logit-Oriented Adversarial Training (LOAT), which can mitigate the trade-off between robustness and accuracy. Our experiments demonstrate that the proposed regularization strategy can boost the performance of the prevalent adversarial training algorithms.
arXiv Detail & Related papers (2024-03-26T09:22:37Z)
Deep Manifold Transformation for Protein Representation Learning [42.43017670985785]
We propose a new underlinedeep underlinemanifold underlinetrans approach for universal underlineprotein underlinerepresentation underlinelformation (DMTPRL) It employs manifold learning strategies to improve the quality and adaptability of the learned embeddings. Our proposed DMTPRL method outperforms state-of-the-art baselines on diverse downstream tasks across popular datasets.
arXiv Detail & Related papers (2024-01-12T18:38:14Z)
Toward Understanding BERT-Like Pre-Training for DNA Foundation Models [78.48760388079523]
Existing pre-training methods for DNA sequences rely on direct adoptions of BERT pre-training from NLP. We introduce a novel approach called RandomMask, which gradually increases the task difficulty of BERT-like pre-training by continuously expanding its mask boundary. RandomMask achieves a staggering 68.16% in Matthew's correlation coefficient for Epigenetic Mark Prediction, a groundbreaking increase of 19.85% over the baseline.
arXiv Detail & Related papers (2023-10-11T16:40:57Z)
Uncovering mesa-optimization algorithms in Transformers [61.06055590704677]
Some autoregressive models can learn as an input sequence is processed, without undergoing any parameter changes, and without being explicitly trained to do so. We show that standard next-token prediction error minimization gives rise to a subsidiary learning algorithm that adjusts the model as new inputs are revealed. Our findings explain in-context learning as a product of autoregressive loss minimization and inform the design of new optimization-based Transformer layers.
arXiv Detail & Related papers (2023-09-11T22:42:50Z)
Fast and Functional Structured Data Generators Rooted in Out-of-Equilibrium Physics [44.97217246897902]
We address the challenge of using energy-based models to produce high-quality, label-specific data in structured datasets. Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing. We use a novel training algorithm that exploits non-equilibrium effects.
arXiv Detail & Related papers (2023-07-13T15:08:44Z)
TWINS: A Fine-Tuning Framework for Improved Transferability of Adversarial Robustness and Generalization [89.54947228958494]
This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks. We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework. TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
arXiv Detail & Related papers (2023-03-20T14:12:55Z)
Reprogramming Pretrained Language Models for Protein Sequence Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework. R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences. Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z)
Mutual Exclusivity Training and Primitive Augmentation to Induce Compositionality [84.94877848357896]
Recent datasets expose the lack of the systematic generalization ability in standard sequence-to-sequence models. We analyze this behavior of seq2seq models and identify two contributing factors: a lack of mutual exclusivity bias and the tendency to memorize whole examples. We show substantial empirical improvements using standard sequence-to-sequence models on two widely-used compositionality datasets.
arXiv Detail & Related papers (2022-11-28T17:36:41Z)
Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences. We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM) Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.