BSM: Small but Powerful Biological Sequence Model for Genes and Proteins
- URL: http://arxiv.org/abs/2410.11499v1
- Date: Tue, 15 Oct 2024 11:12:28 GMT
- Title: BSM: Small but Powerful Biological Sequence Model for Genes and Proteins
- Authors: Weixi Xiang, Xueting Han, Xiujuan Chai, Jing Bai,
- Abstract summary: We introduce BSM, a small but powerful mixed-modal biological sequence foundation model.
It is trained on three types of data: RefSeq, Gene Related Sequences, and interleaved biological sequences from the web.
It significantly enhances learning efficiency and cross-modal representation, outperforming models trained solely on unimodal data.
- Score: 6.6055625629542085
- License:
- Abstract: Modeling biological sequences such as DNA, RNA, and proteins is crucial for understanding complex processes like gene regulation and protein synthesis. However, most current models either focus on a single type or treat multiple types of data separately, limiting their ability to capture cross-modal relationships. We propose that by learning the relationships between these modalities, the model can enhance its understanding of each type. To address this, we introduce BSM, a small but powerful mixed-modal biological sequence foundation model, trained on three types of data: RefSeq, Gene Related Sequences, and interleaved biological sequences from the web. These datasets capture the genetic flow, gene-protein relationships, and the natural co-occurrence of diverse biological data, respectively. By training on mixed-modal data, BSM significantly enhances learning efficiency and cross-modal representation, outperforming models trained solely on unimodal data. With only 110M parameters, BSM achieves performance comparable to much larger models across both single-modal and mixed-modal tasks, and uniquely demonstrates in-context learning capability for mixed-modal tasks, which is absent in existing models. Further scaling to 270M parameters demonstrates even greater performance gains, highlighting the potential of BSM as a significant advancement in multimodal biological sequence modeling.
Related papers
- DPLM-2: A Multimodal Diffusion Protein Language Model [75.98083311705182]
We introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures.
DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals.
Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures.
arXiv Detail & Related papers (2024-10-17T17:20:24Z) - scFusionTTT: Single-cell transcriptomics and proteomics fusion with Test-Time Training layers [14.254553622632594]
scFusion is a novel method for Single-Cell multimodal omics Fusion with TTT-based masked autoencoder.
We combine the order information of genes and proteins in the human genome with the TTT layer, fuse multimodal omics, and enhance unimodal omics analysis.
arXiv Detail & Related papers (2024-10-17T06:29:29Z) - Generating Multi-Modal and Multi-Attribute Single-Cell Counts with CFGen [76.02070962797794]
We present Cell Flow for Generation, a flow-based conditional generative model for multi-modal single-cell counts.
Our results suggest improved recovery of crucial biological data characteristics while accounting for novel generative tasks.
arXiv Detail & Related papers (2024-07-16T14:05:03Z) - Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - Multi-modal Transfer Learning between Biological Foundation Models [2.6545450959042234]
We propose a multi-modal-specific model that connects DNA, RNA, and proteins by leveraging information from different pre-trained modality encoders.
We show that our model, dubbed IsoFormer, is able to accurately predict differential transcript expression, outperforming existing methods.
We open-source our model, paving the way for new multi-modal gene expression approaches.
arXiv Detail & Related papers (2024-06-20T09:44:53Z) - Integrating Chemical Language and Molecular Graph in Multimodal Fused Deep Learning for Drug Property Prediction [9.388979080270103]
We construct multimodal deep learning models to cover different molecular representations.
Compared with mono-modal models, our multimodal fused deep learning (MMFDL) models outperform single models in accuracy, reliability, and resistance capability against noise.
arXiv Detail & Related papers (2023-12-29T07:19:42Z) - Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z) - BioBridge: Bridging Biomedical Foundation Models via Knowledge Graphs [27.32543389443672]
We present BioBridge, a novel parameter-efficient learning framework to bridge independently trained unimodal FMs to establish multimodal behavior.
Our empirical results demonstrate that BioBridge can beat the best baseline KG embedding methods.
We also identify BioBridge demonstrates out-of-domain generalization ability by extrapolating to unseen modalities or relations.
arXiv Detail & Related papers (2023-10-05T05:30:42Z) - Scaling Laws for Generative Mixed-Modal Language Models [103.25737824352949]
We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them.
Specifically, we explicitly model the optimal synergy and competition due to data and model size as an additive term to previous uni-modal scaling laws.
We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities.
arXiv Detail & Related papers (2023-01-10T00:20:06Z) - Differentiable Agent-based Epidemiology [71.81552021144589]
We introduce GradABM: a scalable, differentiable design for agent-based modeling that is amenable to gradient-based learning with automatic differentiation.
GradABM can quickly simulate million-size populations in few seconds on commodity hardware, integrate with deep neural networks and ingest heterogeneous data sources.
arXiv Detail & Related papers (2022-07-20T07:32:02Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.