Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions
- URL: http://arxiv.org/abs/2408.16245v3
- Date: Tue, 01 Apr 2025 17:10:17 GMT
- Title: Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions
- Authors: Sully F. Chen, Robert J. Steele, Glen M. Hocky, Beakal Lemeneh, Shivanand P. Lad, Eric K. Oermann,
- Abstract summary: We present our work training the largest open-source multi-omic foundation model to date.<n>We show that these multi-omic models can learn joint representations between various single-omic distributions.<n>We also demonstrate that MOMs can be fine-tuned to achieve state-of-the-art results on protein-nucleic acid interaction tasks.
- Score: 4.36852565205713
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. Almost all research on large-scale biosequence transformers has focused on one domain at a time (single-omic), usually DNA/RNA or proteins. These models have seen incredible success in downstream tasks in each domain, and have achieved particularly noteworthy breakthroughs in sequence modeling and structural modeling. However, these single-omic models are naturally incapable of efficiently modeling multi-omic tasks, one of the most biologically critical being protein-nucleic acid interactions. We present our work training the largest open-source multi-omic foundation model to date. We show that these multi-omic models (MOMs) can learn joint representations between various single-omic distributions that are emergently consistent with the Central Dogma of molecular biology despite only being trained on unlabeled biosequences. We further demonstrate that MOMs can be fine-tuned to achieve state-of-the-art results on protein-nucleic acid interaction tasks, namely predicting the change in Gibbs free energy ($\Delta G$) of the binding interaction between a given nucleic acid and protein. Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any \textit{a priori} structural training, allowing us to predict which protein residues are most involved in the protein-nucleic acid binding interaction. Lastly, we provide evidence that multi-omic biosequence models are in many cases superior to foundation models trained on single-omics distributions, both in performance-per-FLOP and absolute performance, suggesting a more generalized or foundational approach to building these models for biology.
Related papers
- An All-Atom Generative Model for Designing Protein Complexes [49.09672038729524]
APM (All-Atom Protein Generative Model) is a model specifically designed for modeling multi-chain proteins.
By integrating atom-level information and leveraging data on multi-chain proteins, APM is capable of precisely modeling inter-chain interactions and designing protein complexes with binding capabilities from scratch.
arXiv Detail & Related papers (2025-04-17T16:37:41Z) - UniGenX: Unified Generation of Sequence and Structure with Autoregressive Diffusion [61.690978792873196]
Existing approaches rely on either autoregressive sequence models or diffusion models.
We propose UniGenX, a unified framework that combines autoregressive next-token prediction with conditional diffusion models.
We validate the effectiveness of UniGenX on material and small molecule generation tasks.
arXiv Detail & Related papers (2025-03-09T16:43:07Z) - A Generalist Cross-Domain Molecular Learning Framework for Structure-Based Drug Discovery [32.573496601865465]
Structure-based drug discovery (SBDD) is a systematic scientific process that develops new drugs by leveraging the detailed physical structure of the target protein.
Recent advancements in pre-trained models for biomolecules have demonstrated remarkable success across various biochemical applications.
arXiv Detail & Related papers (2025-03-06T12:04:56Z) - Life-Code: Central Dogma Modeling with Multi-Omics Sequence Unification [53.488387420073536]
Life-Code is a comprehensive framework that spans different biological functions.
Life-Code achieves state-of-the-art performance on various tasks across three omics.
arXiv Detail & Related papers (2025-02-11T06:53:59Z) - Biology Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models [51.316001071698224]
We introduce Biology-Instructions, the first large-scale multi-omics biological sequences-related instruction-tuning dataset.
This dataset can bridge the gap between large language models (LLMs) and complex biological sequences-related tasks.
We also develop a strong baseline called ChatMultiOmics with a novel three-stage training pipeline.
arXiv Detail & Related papers (2024-12-26T12:12:23Z) - M2oE: Multimodal Collaborative Expert Peptide Model [0.0]
This paper proposes the M2oE multi-modal collaborative expert peptide model.
Based on previous work, by integrating sequence and spatial structural information, employing expert model and Cross-Attention Mechanism, the model's capabilities are balanced and improved.
Experimental results indicate that the M2oE model performs excellently in complex task predictions.
arXiv Detail & Related papers (2024-11-20T09:52:52Z) - SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models.
It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features.
Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z) - DPLM-2: A Multimodal Diffusion Protein Language Model [75.98083311705182]
We introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures.
DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals.
Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures.
arXiv Detail & Related papers (2024-10-17T17:20:24Z) - BSM: Small but Powerful Biological Sequence Model for Genes and Proteins [6.6055625629542085]
We introduce BSM, a small but powerful mixed-modal biological sequence foundation model.
It is trained on three types of data: RefSeq, Gene Related Sequences, and interleaved biological sequences from the web.
It significantly enhances learning efficiency and cross-modal representation, outperforming models trained solely on unimodal data.
arXiv Detail & Related papers (2024-10-15T11:12:28Z) - Multi-Peptide: Multimodality Leveraged Language-Graph Learning of Peptide Properties [5.812284760539713]
Multi-Peptide is an innovative approach that combines transformer-based language models with Graph Neural Networks (GNNs) to predict peptide properties.
Evaluations on hemolysis and nonfouling datasets demonstrate Multi-Peptide's robustness, achieving state-of-the-art 86.185% accuracy in hemolysis prediction.
This study highlights the potential of multimodal learning in bioinformatics, paving the way for accurate and reliable predictions in peptide-based research and applications.
arXiv Detail & Related papers (2024-07-02T20:13:47Z) - Multi-modal Transfer Learning between Biological Foundation Models [2.6545450959042234]
We propose a multi-modal-specific model that connects DNA, RNA, and proteins by leveraging information from different pre-trained modality encoders.
We show that our model, dubbed IsoFormer, is able to accurately predict differential transcript expression, outperforming existing methods.
We open-source our model, paving the way for new multi-modal gene expression approaches.
arXiv Detail & Related papers (2024-06-20T09:44:53Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein [74.64101864289572]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - Towards Joint Sequence-Structure Generation of Nucleic Acid and Protein
Complexes with SE(3)-Discrete Diffusion [4.292173366949847]
We introduce MMDiff, a generative model that jointly designs sequences and structures of nucleic acid and protein complexes, independently or in complex.
Such a model has important implications for emerging areas of macromolecular design including structure-based transcription factor design and design of noncoding RNA sequences.
arXiv Detail & Related papers (2023-12-21T05:53:33Z) - Navigating protein landscapes with a machine-learned transferable
coarse-grained model [29.252004942896875]
coarse-grained (CG) model with similar prediction performance has been a long-standing challenge.
We develop a bottom-up CG force field with chemical transferability, which can be used for extrapolative molecular dynamics on new sequences.
We demonstrate that the model successfully predicts folded structures, intermediates, metastable folded and unfolded basins, and the fluctuations of intrinsically disordered proteins.
arXiv Detail & Related papers (2023-10-27T17:10:23Z) - Atom-Motif Contrastive Transformer for Molecular Property Prediction [68.85399466928976]
Graph Transformer (GT) models have been widely used in the task of Molecular Property Prediction (MPP)
We propose a novel Atom-Motif Contrastive Transformer (AMCT) which explores atom-level interactions and considers motif-level interactions.
Our proposed AMCT is extensively evaluated on seven popular benchmark datasets, and both quantitative and qualitative results firmly demonstrate its effectiveness.
arXiv Detail & Related papers (2023-10-11T10:03:10Z) - Efficient Prediction of Peptide Self-assembly through Sequential and
Graphical Encoding [57.89530563948755]
This work provides a benchmark analysis of peptide encoding with advanced deep learning models.
It serves as a guide for a wide range of peptide-related predictions such as isoelectric points, hydration free energy, etc.
arXiv Detail & Related papers (2023-07-17T00:43:33Z) - Bidirectional Generation of Structure and Properties Through a Single
Molecular Foundation Model [44.60174246341653]
We present a novel multimodal molecular pre-trained model that incorporates the modalities of structure and biochemical properties.
Our proposed model pipeline of data handling and training objectives aligns the structure/property features in a common embedding space.
These contributions emerge synergistic knowledge, allowing us to tackle both multimodal and unimodal downstream tasks through a single model.
arXiv Detail & Related papers (2022-11-19T05:16:08Z) - State-specific protein-ligand complex structure prediction with a
multi-scale deep generative model [68.28309982199902]
We present NeuralPLexer, a computational approach that can directly predict protein-ligand complex structures.
Our study suggests that a data-driven approach can capture the structural cooperativity between proteins and small molecules, showing promise in accelerating the design of enzymes, drug molecules, and beyond.
arXiv Detail & Related papers (2022-09-30T01:46:38Z) - Diversifying Design of Nucleic Acid Aptamers Using Unsupervised Machine
Learning [54.247560894146105]
Inverse design of short single-stranded RNA and DNA sequences (aptamers) is the task of finding sequences that satisfy a set of desired criteria.
We propose to use an unsupervised machine learning model known as the Potts model to discover new, useful sequences with controllable sequence diversity.
arXiv Detail & Related papers (2022-08-10T13:30:58Z) - Protein Structure and Sequence Generation with Equivariant Denoising
Diffusion Probabilistic Models [3.5450828190071646]
An important task in bioengineering is designing proteins with specific 3D structures and chemical properties which enable targeted functions.
We introduce a generative model of both protein structure and sequence that can operate at significantly larger scales than previous molecular generative modeling approaches.
arXiv Detail & Related papers (2022-05-26T16:10:09Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.