Epigenomic language models powered by Cerebras
- URL: http://arxiv.org/abs/2112.07571v1
- Date: Tue, 14 Dec 2021 17:23:42 GMT
- Title: Epigenomic language models powered by Cerebras
- Authors: Meredith V. Trotter, Cuong Q. Nguyen, Stephen Young, Rob T. Woodruff,
Kim M. Branson
- Abstract summary: Epigenomic BERT (or EBERT) learns representations based on both DNA sequence and paired epigenetic state inputs.
We show EBERT's transfer learning potential by demonstrating strong performance on a cell type-specific transcription factor binding prediction task.
Our fine-tuned model exceeds state of the art performance on 4 of 13 evaluation datasets from ENCODE-DREAM benchmarks and earns an overall rank of 3rd on the challenge leaderboard.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large scale self-supervised pre-training of Transformer language models has
advanced the field of Natural Language Processing and shown promise in
cross-application to the biological `languages' of proteins and DNA. Learning
effective representations of DNA sequences using large genomic sequence
corpuses may accelerate the development of models of gene regulation and
function through transfer learning. However, to accurately model cell
type-specific gene regulation and function, it is necessary to consider not
only the information contained in DNA nucleotide sequences, which is mostly
invariant between cell types, but also how the local chemical and structural
`epigenetic state' of chromosomes varies between cell types. Here, we introduce
a Bidirectional Encoder Representations from Transformers (BERT) model that
learns representations based on both DNA sequence and paired epigenetic state
inputs, which we call Epigenomic BERT (or EBERT). We pre-train EBERT with a
masked language model objective across the entire human genome and across 127
cell types. Training this complex model with a previously prohibitively large
dataset was made possible for the first time by a partnership with Cerebras
Systems, whose CS-1 system powered all pre-training experiments. We show
EBERT's transfer learning potential by demonstrating strong performance on a
cell type-specific transcription factor binding prediction task. Our fine-tuned
model exceeds state of the art performance on 4 of 13 evaluation datasets from
ENCODE-DREAM benchmarks and earns an overall rank of 3rd on the challenge
leaderboard. We explore how the inclusion of epigenetic data and task specific
feature augmentation impact transfer learning performance.
Related papers
- Long-range gene expression prediction with token alignment of large language model [37.10820914895689]
We introduce Genetic sequence Token Alignment (GTA), which aligns genetic sequence features with natural language tokens.
GTA learns the regulatory grammar and allows us to further incorporate gene-specific human annotations as prompts.
GTA represents a powerful and novel cross-modal approach to gene expression prediction by utilizing a pretrained language model.
arXiv Detail & Related papers (2024-10-02T02:42:29Z) - Generating Multi-Modal and Multi-Attribute Single-Cell Counts with CFGen [76.02070962797794]
We present Cell Flow for Generation, a flow-based conditional generative model for multi-modal single-cell counts.
Our results suggest improved recovery of crucial biological data characteristics while accounting for novel generative tasks.
arXiv Detail & Related papers (2024-07-16T14:05:03Z) - Multi-modal Transfer Learning between Biological Foundation Models [2.6545450959042234]
We propose a multi-modal-specific model that connects DNA, RNA, and proteins by leveraging information from different pre-trained modality encoders.
We show that our model, dubbed IsoFormer, is able to accurately predict differential transcript expression, outperforming existing methods.
We open-source our model, paving the way for new multi-modal gene expression approaches.
arXiv Detail & Related papers (2024-06-20T09:44:53Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - In-Context Language Learning: Architectures and Algorithms [73.93205821154605]
We study ICL through the lens of a new family of model problems we term in context language learning (ICLL)
We evaluate a diverse set of neural sequence models on regular ICLL tasks.
arXiv Detail & Related papers (2024-01-23T18:59:21Z) - A single-cell gene expression language model [2.9112649816695213]
We propose a machine learning system to learn context dependencies between genes.
Our model, Exceiver, is trained across a diversity of cell types using a self-supervised task.
We found agreement between the similarity profiles of latent sample representations and learned gene embeddings with respect to biological annotations.
arXiv Detail & Related papers (2022-10-25T20:52:19Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.