GeneMask: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning
- URL: http://arxiv.org/abs/2307.15933v1
- Date: Sat, 29 Jul 2023 09:17:16 GMT
- Title: GeneMask: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning
- Authors: Soumyadeep Roy, Jonas Wallat, Sowmya S Sundaram, Wolfgang Nejdl, Niloy
Ganguly
- Abstract summary: We propose a novel masking algorithm, GeneMask, for training of gene sequences.
We observe that GeneMask-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets.
We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs.
- Score: 18.24044777484094
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale language models such as DNABert and LOGO aim to learn optimal
gene representations and are trained on the entire Human Reference Genome.
However, standard tokenization schemes involve a simple sliding window of
tokens like k-mers that do not leverage any gene-based semantics and thus may
lead to (trivial) masking of easily predictable sequences and subsequently
inefficient Masked Language Modeling (MLM) training. Therefore, we propose a
novel masking algorithm, GeneMask, for MLM training of gene sequences, where we
randomly identify positions in a gene sequence as mask centers and locally
select the span around the mask center with the highest Normalized Pointwise
Mutual Information (NPMI) to mask. We observe that in the absence of
human-understandable semantics in the genomics domain (in contrast, semantic
units like words and phrases are inherently available in NLP), GeneMask-based
models substantially outperform the SOTA models (DNABert and LOGO) over four
benchmark gene sequence classification datasets in five few-shot settings (10
to 1000-shot). More significantly, the GeneMask-based DNABert model is trained
for less than one-tenth of the number of epochs of the original SOTA model. We
also observe a strong correlation between top-ranked PMI tokens and conserved
DNA sequence motifs, which may indicate the incorporation of latent genomic
information. The codes (including trained models) and datasets are made
publicly available at https://github.com/roysoumya/GeneMask.
Related papers
- Long-range gene expression prediction with token alignment of large language model [37.10820914895689]
We introduce Genetic sequence Token Alignment (GTA), which aligns genetic sequence features with natural language tokens.
GTA learns the regulatory grammar and allows us to further incorporate gene-specific human annotations as prompts.
GTA represents a powerful and novel cross-modal approach to gene expression prediction by utilizing a pretrained language model.
arXiv Detail & Related papers (2024-10-02T02:42:29Z) - Unlocking Efficiency: Adaptive Masking for Gene Transformer Models [19.699485326192846]
Gene transformer models such as Nucleotide Transformer, DNABert, and LOGO are trained to learn optimal gene sequence.
Gene sequences lack well-defined semantic units similar to words or sentences of NLP domain.
Our proposed Curriculum Masking-based Gene Masking Strategy (CM-GEMS) demonstrates superior representation learning capabilities.
arXiv Detail & Related papers (2024-08-13T19:45:02Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide
Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level.
On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z) - Masked Autoencoding for Scalable and Generalizable Decision Making [93.84855114717062]
MaskDP is a simple and scalable self-supervised pretraining method for reinforcement learning and behavioral cloning.
We find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching.
arXiv Detail & Related papers (2022-11-23T07:04:41Z) - Epigenomic language models powered by Cerebras [0.0]
Epigenomic BERT (or EBERT) learns representations based on both DNA sequence and paired epigenetic state inputs.
We show EBERT's transfer learning potential by demonstrating strong performance on a cell type-specific transcription factor binding prediction task.
Our fine-tuned model exceeds state of the art performance on 4 of 13 evaluation datasets from ENCODE-DREAM benchmarks and earns an overall rank of 3rd on the challenge leaderboard.
arXiv Detail & Related papers (2021-12-14T17:23:42Z) - Pre-training Co-evolutionary Protein Representation via A Pairwise
Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences.
We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM)
Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z) - A deep learning classifier for local ancestry inference [63.8376359764052]
Local ancestry inference identifies the ancestry of each segment of an individual's genome.
We develop a new LAI tool using a deep convolutional neural network with an encoder-decoder architecture.
We show that our model is able to learn admixture as a zero-shot task, yielding ancestry assignments that are nearly as accurate as those from the existing gold standard tool, RFMix.
arXiv Detail & Related papers (2020-11-04T00:42:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.