Related papers: DNABERT-S: Learning Species-Aware DNA Embedding with Genome Foundation Models

DNABERT-S: Learning Species-Aware DNA Embedding with Genome Foundation Models

URL: http://arxiv.org/abs/2402.08777v2
Date: Thu, 15 Feb 2024 04:55:23 GMT
Title: DNABERT-S: Learning Species-Aware DNA Embedding with Genome Foundation Models
Authors: Zhihan Zhou, Weimin Wu, Harrison Ho, Jiayi Wang, Lizhen Shi, Ramana V Davuluri, Zhong Wang, Han Liu
Abstract summary: We introduce DNABERT-S, a genome foundation model that specializes in creating species-aware DNA embeddings. We introduce MI-Mix, a contrastive objective that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer. Empirical results on 18 diverse datasets showed DNABERT-S's remarkable performance.
Score: 8.159258510270243
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Effective DNA embedding remains crucial in genomic analysis, particularly in scenarios lacking labeled data for model fine-tuning, despite the significant advancements in genome foundation models. A prime example is metagenomics binning, a critical process in microbiome research that aims to group DNA sequences by their species from a complex mixture of DNA sequences derived from potentially thousands of distinct, often uncharacterized species. To fill the lack of effective DNA embedding models, we introduce DNABERT-S, a genome foundation model that specializes in creating species-aware DNA embeddings. To encourage effective embeddings to error-prone long-read DNA sequences, we introduce Manifold Instance Mixup (MI-Mix), a contrastive objective that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer. We further enhance it with the proposed Curriculum Contrastive Learning (C$^2$LR) strategy. Empirical results on 18 diverse datasets showed DNABERT-S's remarkable performance. It outperforms the top baseline's performance in 10-shot species classification with just a 2-shot training while doubling the Adjusted Rand Index (ARI) in species clustering and substantially increasing the number of correctly identified species in metagenomics binning. The code, data, and pre-trained model are publicly available at https://github.com/Zhihan1996/DNABERT_S.

Related papers

BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects [14.172782866715844]
Large language models (LLMs) trained on text demonstrated remarkable results on natural language processing (NLP) tasks.<n>DNA differs fundamentally from natural language, as it lacks clearly defined words or a consistent grammar.<n>We pre-train foundation models that effectively integrate sequence variations, in particular Single Nucleotide Polymorphisms (SNPs)<n>Our findings indicate that integrating sequence variations into DNALMs helps capture the biological functions as seen in improvements on all fine-tuning tasks.
arXiv Detail & Related papers (2025-06-26T13:56:32Z)
SPACE: Your Genomic Profile Predictor is a Powerful DNA Foundation Model [13.059484204657586]
We show that supervised training for genomic profile prediction serves as a more effective alternative to pure sequence pre-training.<n>Our model achieves state-of-the-art performance, establishing that DNA models trained with supervised genomic profiles serve as powerful DNA representation learners.
arXiv Detail & Related papers (2025-06-02T16:23:05Z)
GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype [51.58774936662233]
Building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations.<n>In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data.<n>We introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes.
arXiv Detail & Related papers (2025-05-06T03:35:24Z)
HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model [70.69095062674944]
We propose HybriDNA, a decoder-only DNA language model that incorporates a hybrid Transformer-Mamba2 architecture. This hybrid design enables HybriDNA to efficiently process DNA sequences up to 131kb in length with single-nucleotide resolution. HybriDNA achieves state-of-the-art performance across 33 DNA understanding datasets curated from the BEND, GUE, and LRB benchmarks.
arXiv Detail & Related papers (2025-02-15T14:23:43Z)
Adversarial Examples for DNA Classification [0.0]
We adapt commonly used attack algorithms in text classification for DNA sequence classification. We evaluate the impact of various attack methods on DNA sequence classification at the character, word, and sentence levels.
arXiv Detail & Related papers (2024-09-29T21:20:57Z)
CGRclust: Chaos Game Representation for Twin Contrastive Clustering of Unlabelled DNA Sequences [0.0]
CGRclust is a novel combination of unsupervised twin contrastive clustering of Chaos Game Representations (CGR) of DNA sequences, with convolutional neural networks (CNNs) CGRclust is the first method to use unsupervised learning for image classification for clustering datasets of DNA sequences. CGRclust is the only method that surpasses 81.70% accuracy across all four taxonomic levels tested for mitochondrial DNA genomes of fish.
arXiv Detail & Related papers (2024-07-01T23:24:05Z)
VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z)
DiscDiff: Latent Diffusion Model for DNA Sequence Generation [4.946462450157714]
We introduce DiscDiff, a Latent Diffusion Model tailored for generating discrete DNA sequences, and Absorb-Escape, a post-training algorithm designed to refine these sequences. EPD-GenDNA is the first comprehensive, multi-species dataset for DNA generation, encompassing 160,000 unique sequences from 15 species. We hope this study will advance the generative modelling of DNA, with potential implications for gene therapy and protein production.
arXiv Detail & Related papers (2024-02-08T22:06:55Z)
BEND: Benchmarking DNA Language Models on biologically meaningful tasks [7.005668635562045]
We introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks. We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features.
arXiv Detail & Related papers (2023-11-21T12:34:00Z)
BarcodeBERT: Transformers for Biodiversity Analysis [19.082058886309028]
We propose BarcodeBERT, the first self-supervised method for general biodiversity analysis. BarcodeBERT pretrained on a large DNA barcode dataset outperforms DNABERT and DNABERT-2 on multiple downstream classification tasks.
arXiv Detail & Related papers (2023-11-04T13:25:49Z)
Embed-Search-Align: DNA Sequence Alignment using Transformer Models [2.48439258515764]
We bridge the gap by framing the sequence alignment task for Transformer models as an "Embed-Search-Align" task. A novel Reference-Free DNA Embedding model generates embeddings of reads and reference fragments, which are projected into a shared vector space. DNA-ESA is 99% accurate when aligning 250-length reads onto a human genome (3gb), rivaling conventional methods such as Bowtie and BWA-Mem.
arXiv Detail & Related papers (2023-09-20T06:30:39Z)
DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks [14.931476374660944]
DNAGPT is a generalized DNA pre-training model trained on over 200 billion base pairs from all mammals. By enhancing the classic GPT model with a binary classification task, a numerical regression task, and a comprehensive token language, DNAGPT can handle versatile DNA analysis tasks.
arXiv Detail & Related papers (2023-07-11T06:30:43Z)
HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level. On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z)
Diversifying Design of Nucleic Acid Aptamers Using Unsupervised Machine Learning [54.247560894146105]
Inverse design of short single-stranded RNA and DNA sequences (aptamers) is the task of finding sequences that satisfy a set of desired criteria. We propose to use an unsupervised machine learning model known as the Potts model to discover new, useful sequences with controllable sequence diversity.
arXiv Detail & Related papers (2022-08-10T13:30:58Z)
rfPhen2Gen: A machine learning based association study of brain imaging phenotypes to genotypes [71.1144397510333]
We learned machine learning models to predict SNPs using 56 brain imaging QTs. SNPs within the known Alzheimer disease (AD) risk gene APOE had lowest RMSE for lasso and random forest. Random forests identified additional SNPs that were not prioritized by the linear models but are known to be associated with brain-related disorders.
arXiv Detail & Related papers (2022-03-31T20:15:22Z)
Multi-modal Self-supervised Pre-training for Regulatory Genome Across Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT. We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z)
A deep learning classifier for local ancestry inference [63.8376359764052]
Local ancestry inference identifies the ancestry of each segment of an individual's genome. We develop a new LAI tool using a deep convolutional neural network with an encoder-decoder architecture. We show that our model is able to learn admixture as a zero-shot task, yielding ancestry assignments that are nearly as accurate as those from the existing gold standard tool, RFMix.
arXiv Detail & Related papers (2020-11-04T00:42:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.