Optimizing Mirror-Image Peptide Sequence Design for Data Storage via Peptide Bond Cleavage Prediction
- URL: http://arxiv.org/abs/2510.25814v1
- Date: Wed, 29 Oct 2025 14:40:13 GMT
- Title: Optimizing Mirror-Image Peptide Sequence Design for Data Storage via Peptide Bond Cleavage Prediction
- Authors: Yilong Lu, Si Chen, Songyan Gao, Han Liu, Xin Dong, Wenfeng Shen, Guangtai Ding,
- Abstract summary: Mirror-image peptides composed of D-amino acids have emerged as a promising biological storage medium.<n>This study is the first to propose improving sequencing accuracy indirectly by optimizing the design of mirror-image peptide sequences.
- Score: 10.510705826988952
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional non-biological storage media, such as hard drives, face limitations in both storage density and lifespan due to the rapid growth of data in the big data era. Mirror-image peptides composed of D-amino acids have emerged as a promising biological storage medium due to their high storage density, structural stability, and long lifespan. The sequencing of mirror-image peptides relies on \textit{de-novo} technology. However, its accuracy is limited by the scarcity of tandem mass spectrometry datasets and the challenges that current algorithms encounter when processing these peptides directly. This study is the first to propose improving sequencing accuracy indirectly by optimizing the design of mirror-image peptide sequences. In this work, we introduce DBond, a deep neural network based model that integrates sequence features, precursor ion properties, and mass spectrometry environmental factors for the prediction of mirror-image peptide bond cleavage. In this process, sequences with a high peptide bond cleavage ratio, which are easy to sequence, are selected. The main contributions of this study are as follows. First, we constructed MiPD513, a tandem mass spectrometry dataset containing 513 mirror-image peptides. Second, we developed the peptide bond cleavage labeling algorithm (PBCLA), which generated approximately 12.5 million labeled data based on MiPD513. Third, we proposed a dual prediction strategy that combines multi-label and single-label classification. On an independent test set, the single-label classification strategy outperformed other methods in both single and multiple peptide bond cleavage prediction tasks, offering a strong foundation for sequence optimization.
Related papers
- Pep2Prob Benchmark: Predicting Fragment Ion Probability for MS$^2$-based Proteomics [11.487378569927499]
Pep2Prob is the first comprehensive dataset and benchmark designed for peptide-specific ion probability prediction.<n>Dataset contains fragment ion probability statistics for 608,780 precursors is a pair of peptide sequence and charge state.
arXiv Detail & Related papers (2025-08-12T20:39:50Z) - Diffusion Decoding for Peptide De Novo Sequencing [71.91773485443125]
This paper investigates using diffusion decoders adapted for the discrete data domain.<n>These decoders provide a different approach, allowing sequence generation to start from any peptide segment.<n>Although peptide precision and recall were still 0, the best diffusion decoder design with the DINOISER loss function obtained a statistically significant improvement in amino acid recall by 0.373.
arXiv Detail & Related papers (2025-07-15T03:38:01Z) - NovoBench: Benchmarking Deep Learning-based De Novo Peptide Sequencing Methods in Proteomics [58.03989832372747]
We present the first unified benchmark NovoBench for emphde novo peptide sequencing.
It comprises diverse mass spectrum data, integrated models, and comprehensive evaluation metrics.
Recent methods, including DeepNovo, PointNovo, Casanovo, InstaNovo, AdaNovo and $pi$-HelixNovo are integrated into our framework.
arXiv Detail & Related papers (2024-06-16T08:23:21Z) - Transformer-based de novo peptide sequencing for data-independent acquisition mass spectrometry [1.338778493151964]
We introduce DiaTrans, a deep-learning model based on transformer architecture.
It deciphers peptide sequences from DIA mass spectrometry data.
Our results show significant improvements over existing STOA methods.
arXiv Detail & Related papers (2024-02-17T19:04:23Z) - PepHarmony: A Multi-View Contrastive Learning Framework for Integrated
Sequence and Structure-Based Peptide Encoding [21.126660909515607]
This study introduces a novel multi-view contrastive learning framework PepHarmony for the sequence-based peptide encoding task.
We carefully select datasets from the Protein Data Bank (PDB) and AlphaFold database to encompass a broad spectrum of peptide sequences and structures.
The experimental data highlights PepHarmony's exceptional capability in capturing the intricate relationship between peptide sequences and structures.
arXiv Detail & Related papers (2024-01-21T01:16:53Z) - ContraNovo: A Contrastive Learning Approach to Enhance De Novo Peptide
Sequencing [70.12220342151113]
ContraNovo is a pioneering algorithm that leverages contrastive learning to extract the relationship between spectra and peptides.
ContraNovo consistently outshines contemporary state-of-the-art solutions.
arXiv Detail & Related papers (2023-12-18T12:49:46Z) - Efficient Prediction of Peptide Self-assembly through Sequential and
Graphical Encoding [57.89530563948755]
This work provides a benchmark analysis of peptide encoding with advanced deep learning models.
It serves as a guide for a wide range of peptide-related predictions such as isoelectric points, hydration free energy, etc.
arXiv Detail & Related papers (2023-07-17T00:43:33Z) - Designing Biological Sequences via Meta-Reinforcement Learning and
Bayesian Optimization [68.28697120944116]
We train an autoregressive generative model via Meta-Reinforcement Learning to propose promising sequences for selection.
We pose this problem as that of finding an optimal policy over a distribution of MDPs induced by sampling subsets of the data.
Our in-silico experiments show that meta-learning over such ensembles provides robustness against reward misspecification and achieves competitive results.
arXiv Detail & Related papers (2022-09-13T18:37:27Z) - EBM-Fold: Fully-Differentiable Protein Folding Powered by Energy-based
Models [53.17320541056843]
We propose a fully-differentiable approach for protein structure optimization, guided by a data-driven generative network.
Our EBM-Fold approach can efficiently produce high-quality decoys, compared against traditional Rosetta-based structure optimization routines.
arXiv Detail & Related papers (2021-05-11T03:40:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.