DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome
- URL: http://arxiv.org/abs/2306.15006v2
- Date: Mon, 18 Mar 2024 23:59:29 GMT
- Title: DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome
- Authors: Zhihan Zhou, Yanrong Ji, Weijian Li, Pratik Dutta, Ramana Davuluri, Han Liu,
- Abstract summary: We argue that the computation and sample inefficiencies introduced by k-mer tokenization are primary obstacles in developing large genome foundational models.
We provide conceptual and empirical insights into genome tokenization, building on which we propose to replace k-mer tokenization with Byte Pair$.
We introduce DNABERT-2, a refined genome foundation model that adapts an efficient tokenizer and employs multiple strategies to overcome input length constraints.
- Score: 10.051595222470304
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Decoding the linguistic intricacies of the genome is a crucial problem in biology, and pre-trained foundational models such as DNABERT and Nucleotide Transformer have made significant strides in this area. Existing works have largely hinged on k-mer, fixed-length permutations of A, T, C, and G, as the token of the genome language due to its simplicity. However, we argue that the computation and sample inefficiencies introduced by k-mer tokenization are primary obstacles in developing large genome foundational models. We provide conceptual and empirical insights into genome tokenization, building on which we propose to replace k-mer tokenization with Byte Pair Encoding (BPE), a statistics-based data compression algorithm that constructs tokens by iteratively merging the most frequent co-occurring genome segment in the corpus. We demonstrate that BPE not only overcomes the limitations of k-mer tokenization but also benefits from the computational efficiency of non-overlapping tokenization. Based on these insights, we introduce DNABERT-2, a refined genome foundation model that adapts an efficient tokenizer and employs multiple strategies to overcome input length constraints, reduce time and memory expenditure, and enhance model capability. Furthermore, we identify the absence of a comprehensive and standardized benchmark for genome understanding as another significant impediment to fair comparative analysis. In response, we propose the Genome Understanding Evaluation (GUE), a comprehensive multi-species genome classification dataset that amalgamates $36$ distinct datasets across $9$ tasks, with input lengths ranging from $70$ to $10000$. Through comprehensive experiments on the GUE benchmark, we demonstrate that DNABERT-2 achieves comparable performance to the state-of-the-art model with $21 \times$ fewer parameters and approximately $92 \times$ less GPU time in pre-training.
Related papers
- dnaGrinder: a lightweight and high-capacity genomic foundation model [11.646351318648499]
Current genomic foundation models often face a critical tradeoff: smaller models with mediocre performance versus large models with improved performance.
We introduce dnaGrinder, a unique and efficient genomic foundation model.
dnaGrinder excels at managing long-range dependencies within genomic sequences while minimizing computational costs without compromising performance.
arXiv Detail & Related papers (2024-09-24T03:20:07Z) - Unlocking Efficiency: Adaptive Masking for Gene Transformer Models [19.699485326192846]
Gene transformer models such as Nucleotide Transformer, DNABert, and LOGO are trained to learn optimal gene sequence.
Gene sequences lack well-defined semantic units similar to words or sentences of NLP domain.
Our proposed Curriculum Masking-based Gene Masking Strategy (CM-GEMS) demonstrates superior representation learning capabilities.
arXiv Detail & Related papers (2024-08-13T19:45:02Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - BEND: Benchmarking DNA Language Models on biologically meaningful tasks [7.005668635562045]
We introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks.
We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features.
arXiv Detail & Related papers (2023-11-21T12:34:00Z) - DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence
Analysis Tasks [14.931476374660944]
DNAGPT is a generalized DNA pre-training model trained on over 200 billion base pairs from all mammals.
By enhancing the classic GPT model with a binary classification task, a numerical regression task, and a comprehensive token language, DNAGPT can handle versatile DNA analysis tasks.
arXiv Detail & Related papers (2023-07-11T06:30:43Z) - HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide
Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level.
On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z) - SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide
Association Study [48.75445626157713]
SNP2Vec is a scalable self-supervised pre-training approach for understanding SNP.
We apply SNP2Vec to perform long-sequence genomics modeling.
We evaluate the effectiveness of our approach on predicting Alzheimer's disease risk in a Chinese cohort.
arXiv Detail & Related papers (2022-04-14T01:53:58Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z) - Brain Image Synthesis with Unsupervised Multivariate Canonical
CSC$\ell_4$Net [122.8907826672382]
We propose to learn dedicated features that cross both intre- and intra-modal variations using a novel CSC$ell_4$Net.
arXiv Detail & Related papers (2021-03-22T05:19:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.