DNABERT-S: Learning Species-Aware DNA Embedding with Genome Foundation
Models
- URL: http://arxiv.org/abs/2402.08777v2
- Date: Thu, 15 Feb 2024 04:55:23 GMT
- Title: DNABERT-S: Learning Species-Aware DNA Embedding with Genome Foundation
Models
- Authors: Zhihan Zhou, Weimin Wu, Harrison Ho, Jiayi Wang, Lizhen Shi, Ramana V
Davuluri, Zhong Wang, Han Liu
- Abstract summary: We introduce DNABERT-S, a genome foundation model that specializes in creating species-aware DNA embeddings.
We introduce MI-Mix, a contrastive objective that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer.
Empirical results on 18 diverse datasets showed DNABERT-S's remarkable performance.
- Score: 8.159258510270243
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Effective DNA embedding remains crucial in genomic analysis, particularly in
scenarios lacking labeled data for model fine-tuning, despite the significant
advancements in genome foundation models. A prime example is metagenomics
binning, a critical process in microbiome research that aims to group DNA
sequences by their species from a complex mixture of DNA sequences derived from
potentially thousands of distinct, often uncharacterized species. To fill the
lack of effective DNA embedding models, we introduce DNABERT-S, a genome
foundation model that specializes in creating species-aware DNA embeddings. To
encourage effective embeddings to error-prone long-read DNA sequences, we
introduce Manifold Instance Mixup (MI-Mix), a contrastive objective that mixes
the hidden representations of DNA sequences at randomly selected layers and
trains the model to recognize and differentiate these mixed proportions at the
output layer. We further enhance it with the proposed Curriculum Contrastive
Learning (C$^2$LR) strategy. Empirical results on 18 diverse datasets showed
DNABERT-S's remarkable performance. It outperforms the top baseline's
performance in 10-shot species classification with just a 2-shot training while
doubling the Adjusted Rand Index (ARI) in species clustering and substantially
increasing the number of correctly identified species in metagenomics binning.
The code, data, and pre-trained model are publicly available at
https://github.com/Zhihan1996/DNABERT_S.
Related papers
- A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language [3.384797724820242]
Predicting gene function from its DNA sequence is a fundamental challenge in biology.
Deep learning models have been proposed to embed DNA sequences and predict their enzymatic function.
Much of the scientific community's knowledge of biological function is not represented in categorical labels.
arXiv Detail & Related papers (2024-07-21T19:27:43Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - DiscDiff: Latent Diffusion Model for DNA Sequence Generation [4.946462450157714]
We introduce DiscDiff, a Latent Diffusion Model tailored for generating discrete DNA sequences, and Absorb-Escape, a post-training algorithm designed to refine these sequences.
EPD-GenDNA is the first comprehensive, multi-species dataset for DNA generation, encompassing 160,000 unique sequences from 15 species.
We hope this study will advance the generative modelling of DNA, with potential implications for gene therapy and protein production.
arXiv Detail & Related papers (2024-02-08T22:06:55Z) - BEND: Benchmarking DNA Language Models on biologically meaningful tasks [7.005668635562045]
We introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks.
We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features.
arXiv Detail & Related papers (2023-11-21T12:34:00Z) - Fast and Functional Structured Data Generators Rooted in
Out-of-Equilibrium Physics [62.997667081978825]
We address the challenge of using energy-based models to produce high-quality, label-specific data in structured datasets.
Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing.
We use a novel training algorithm that exploits non-equilibrium effects.
arXiv Detail & Related papers (2023-07-13T15:08:44Z) - DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence
Analysis Tasks [14.931476374660944]
DNAGPT is a generalized DNA pre-training model trained on over 200 billion base pairs from all mammals.
By enhancing the classic GPT model with a binary classification task, a numerical regression task, and a comprehensive token language, DNAGPT can handle versatile DNA analysis tasks.
arXiv Detail & Related papers (2023-07-11T06:30:43Z) - HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide
Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level.
On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z) - DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome [10.051595222470304]
We argue that the computation and sample inefficiencies introduced by k-mer tokenization are primary obstacles in developing large genome foundational models.
We provide conceptual and empirical insights into genome tokenization, building on which we propose to replace k-mer tokenization with Byte Pair$.
We introduce DNABERT-2, a refined genome foundation model that adapts an efficient tokenizer and employs multiple strategies to overcome input length constraints.
arXiv Detail & Related papers (2023-06-26T18:43:46Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z) - A deep learning classifier for local ancestry inference [63.8376359764052]
Local ancestry inference identifies the ancestry of each segment of an individual's genome.
We develop a new LAI tool using a deep convolutional neural network with an encoder-decoder architecture.
We show that our model is able to learn admixture as a zero-shot task, yielding ancestry assignments that are nearly as accurate as those from the existing gold standard tool, RFMix.
arXiv Detail & Related papers (2020-11-04T00:42:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.