Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types
- URL: http://arxiv.org/abs/2110.05231v1
- Date: Mon, 11 Oct 2021 12:48:44 GMT
- Title: Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types
- Authors: Shentong Mo, Xi Fu, Chenyang Hong, Yizhen Chen, Yuxuan Zheng, Xiangru
Tang, Zhiqiang Shen, Eric P Xing, Yanyan Lan
- Abstract summary: We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
- Score: 75.65676405302105
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the genome biology research, regulatory genome modeling is an important
topic for many regulatory downstream tasks, such as promoter classification,
transaction factor binding sites prediction. The core problem is to model how
regulatory elements interact with each other and its variability across
different cell types. However, current deep learning methods often focus on
modeling genome sequences of a fixed set of cell types and do not account for
the interaction between multiple regulatory elements, making them only perform
well on the cell types in the training set and lack the generalizability
required in biological applications. In this work, we propose a simple yet
effective approach for pre-training genome data in a multi-modal and
self-supervised manner, which we call GeneBERT. Specifically, we simultaneously
take the 1d sequence of genome data and a 2d matrix of (transcription factors x
regions) as the input, where three pre-training tasks are proposed to improve
the robustness and generalizability of our model. We pre-train our model on the
ATAC-seq dataset with 17 million genome sequences. We evaluate our GeneBERT on
regulatory downstream tasks across different cell types, including promoter
classification, transaction factor binding sites prediction, disease risk
estimation, and splicing sites prediction. Extensive experiments demonstrate
the effectiveness of multi-modal and self-supervised pre-training for
large-scale regulatory genomics data.
Related papers
- scFusionTTT: Single-cell transcriptomics and proteomics fusion with Test-Time Training layers [14.254553622632594]
scFusion is a novel method for Single-Cell multimodal omics Fusion with TTT-based masked autoencoder.
We combine the order information of genes and proteins in the human genome with the TTT layer, fuse multimodal omics, and enhance unimodal omics analysis.
arXiv Detail & Related papers (2024-10-17T06:29:29Z) - Generating Multi-Modal and Multi-Attribute Single-Cell Counts with CFGen [76.02070962797794]
We present Cell Flow for Generation, a flow-based conditional generative model for multi-modal single-cell counts.
Our results suggest improved recovery of crucial biological data characteristics while accounting for novel generative tasks.
arXiv Detail & Related papers (2024-07-16T14:05:03Z) - Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - Cell reprogramming design by transfer learning of functional
transcriptional networks [0.0]
We develop a transfer learning approach to control cell behavior that is pre-trained on transcriptomic data associated with human cell fates.
We show that the number of gene perturbations required to steer from one fate to another increases with decreasing developmental relatedness.
arXiv Detail & Related papers (2024-03-07T19:00:02Z) - Causal machine learning for single-cell genomics [94.28105176231739]
We discuss the application of machine learning techniques to single-cell genomics and their challenges.
We first present the model that underlies most of current causal approaches to single-cell biology.
We then identify open problems in the application of causal approaches to single-cell data.
arXiv Detail & Related papers (2023-10-23T13:35:24Z) - Granger causal inference on DAGs identifies genomic loci regulating
transcription [77.58911272503771]
GrID-Net is a framework based on graph neural networks with lagged message passing for Granger causal inference on DAG-structured systems.
Our application is the analysis of single-cell multimodal data to identify genomic loci that mediate the regulation of specific genes.
arXiv Detail & Related papers (2022-10-18T21:15:10Z) - Epigenomic language models powered by Cerebras [0.0]
Epigenomic BERT (or EBERT) learns representations based on both DNA sequence and paired epigenetic state inputs.
We show EBERT's transfer learning potential by demonstrating strong performance on a cell type-specific transcription factor binding prediction task.
Our fine-tuned model exceeds state of the art performance on 4 of 13 evaluation datasets from ENCODE-DREAM benchmarks and earns an overall rank of 3rd on the challenge leaderboard.
arXiv Detail & Related papers (2021-12-14T17:23:42Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.