Large-Scale Multi-omic Biosequence Transformers for Modeling Peptide-Nucleotide Interactions
- URL: http://arxiv.org/abs/2408.16245v2
- Date: Fri, 27 Sep 2024 06:09:41 GMT
- Title: Large-Scale Multi-omic Biosequence Transformers for Modeling Peptide-Nucleotide Interactions
- Authors: Sully F. Chen, Robert J. Steele, Beakal Lemeneh, Shivanand P. Lad, Eric Oermann,
- Abstract summary: We present our work training the first multi-omic nucleotide-peptide foundation models.
We show that these multi-omic models can learn joint representations between various single-omic distributions.
We also demonstrate that MOMs can be fine-tuned to achieve state-of-the-art results on peptide-nucleotide interaction tasks.
- Score: 2.84640003522012
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. Almost all research on large-scale biosequence transformers has focused on one domain at a time (single-omic), usually nucleotides or peptides. These models have seen incredible success in downstream tasks in each domain and have achieved particularly noteworthy breakthroughs in sequences of peptides and structural modeling. However, these single-omic models are naturally incapable of modeling multi-omic tasks, one of the most biologically critical being nucleotide-peptide interactions. We present our work training the first multi-omic nucleotide-peptide foundation models. We show that these multi-omic models (MOMs) can learn joint representations between various single-omic distributions that are emergently consistent with the Central Dogma of molecular biology, despite only being trained on unlabeled biosequences. We further demonstrate that MOMs can be fine-tuned to achieve state-of-the-art results on peptide-nucleotide interaction tasks, namely predicting the change in Gibbs free energy ({\Delta}G) of the binding interaction between a given oligonucleotide and peptide, as well as the effect on this binding interaction due to mutations in the oligonucleotide sequence ({\Delta}{\Delta}G). Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any prior structural training, allowing us to predict which peptide residues are most involved in the peptide-nucleotide binding interaction. Lastly, we provide evidence that multi-omic biosequence models are non-inferior to foundation models trained on single-omics distributions, suggesting a more generalized or foundational approach to building these models.
Related papers
- DPLM-2: A Multimodal Diffusion Protein Language Model [75.98083311705182]
We introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures.
DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals.
Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures.
arXiv Detail & Related papers (2024-10-17T17:20:24Z) - Multi-Peptide: Multimodality Leveraged Language-Graph Learning of Peptide Properties [5.812284760539713]
Multi-Peptide is an innovative approach that combines transformer-based language models with Graph Neural Networks (GNNs) to predict peptide properties.
Evaluations on hemolysis and nonfouling datasets demonstrate Multi-Peptide's robustness, achieving state-of-the-art 86.185% accuracy in hemolysis prediction.
This study highlights the potential of multimodal learning in bioinformatics, paving the way for accurate and reliable predictions in peptide-based research and applications.
arXiv Detail & Related papers (2024-07-02T20:13:47Z) - Multi-modal Transfer Learning between Biological Foundation Models [2.6545450959042234]
We propose a multi-modal-specific model that connects DNA, RNA, and proteins by leveraging information from different pre-trained modality encoders.
We show that our model, dubbed IsoFormer, is able to accurately predict differential transcript expression, outperforming existing methods.
We open-source our model, paving the way for new multi-modal gene expression approaches.
arXiv Detail & Related papers (2024-06-20T09:44:53Z) - Full-Atom Peptide Design based on Multi-modal Flow Matching [32.58558711545861]
We present PepFlow, the first multi-modal deep generative model grounded in the flow-matching framework for the design of full-atom peptides.
We characterize the peptide structure using rigid backbone frames within the $mathrmSE(3)$ manifold and side-chain angles on high-dimensional tori.
Our approach adeptly tackles various tasks such as fix-backbone sequence design and side-chain packing through partial sampling.
arXiv Detail & Related papers (2024-06-02T12:59:54Z) - Towards Joint Sequence-Structure Generation of Nucleic Acid and Protein
Complexes with SE(3)-Discrete Diffusion [4.292173366949847]
We introduce MMDiff, a generative model that jointly designs sequences and structures of nucleic acid and protein complexes, independently or in complex.
Such a model has important implications for emerging areas of macromolecular design including structure-based transcription factor design and design of noncoding RNA sequences.
arXiv Detail & Related papers (2023-12-21T05:53:33Z) - Atom-Motif Contrastive Transformer for Molecular Property Prediction [68.85399466928976]
Graph Transformer (GT) models have been widely used in the task of Molecular Property Prediction (MPP)
We propose a novel Atom-Motif Contrastive Transformer (AMCT) which explores atom-level interactions and considers motif-level interactions.
Our proposed AMCT is extensively evaluated on seven popular benchmark datasets, and both quantitative and qualitative results firmly demonstrate its effectiveness.
arXiv Detail & Related papers (2023-10-11T10:03:10Z) - Efficient Prediction of Peptide Self-assembly through Sequential and
Graphical Encoding [57.89530563948755]
This work provides a benchmark analysis of peptide encoding with advanced deep learning models.
It serves as a guide for a wide range of peptide-related predictions such as isoelectric points, hydration free energy, etc.
arXiv Detail & Related papers (2023-07-17T00:43:33Z) - Bidirectional Generation of Structure and Properties Through a Single
Molecular Foundation Model [44.60174246341653]
We present a novel multimodal molecular pre-trained model that incorporates the modalities of structure and biochemical properties.
Our proposed model pipeline of data handling and training objectives aligns the structure/property features in a common embedding space.
These contributions emerge synergistic knowledge, allowing us to tackle both multimodal and unimodal downstream tasks through a single model.
arXiv Detail & Related papers (2022-11-19T05:16:08Z) - State-specific protein-ligand complex structure prediction with a
multi-scale deep generative model [68.28309982199902]
We present NeuralPLexer, a computational approach that can directly predict protein-ligand complex structures.
Our study suggests that a data-driven approach can capture the structural cooperativity between proteins and small molecules, showing promise in accelerating the design of enzymes, drug molecules, and beyond.
arXiv Detail & Related papers (2022-09-30T01:46:38Z) - Diversifying Design of Nucleic Acid Aptamers Using Unsupervised Machine
Learning [54.247560894146105]
Inverse design of short single-stranded RNA and DNA sequences (aptamers) is the task of finding sequences that satisfy a set of desired criteria.
We propose to use an unsupervised machine learning model known as the Potts model to discover new, useful sequences with controllable sequence diversity.
arXiv Detail & Related papers (2022-08-10T13:30:58Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.