BarcodeBERT: Transformers for Biodiversity Analysis
- URL: http://arxiv.org/abs/2311.02401v1
- Date: Sat, 4 Nov 2023 13:25:49 GMT
- Title: BarcodeBERT: Transformers for Biodiversity Analysis
- Authors: Pablo Millan Arias and Niousha Sadjadi and Monireh Safari and ZeMing
Gong and Austin T. Wang and Scott C. Lowe and Joakim Bruslund Haurum and
Iuliia Zarubiieva and Dirk Steinke and Lila Kari and Angel X. Chang and
Graham W. Taylor
- Abstract summary: We propose BarcodeBERT, the first self-supervised method for general biodiversity analysis.
BarcodeBERT pretrained on a large DNA barcode dataset outperforms DNABERT and DNABERT-2 on multiple downstream classification tasks.
- Score: 19.082058886309028
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding biodiversity is a global challenge, in which DNA barcodes -
short snippets of DNA that cluster by species - play a pivotal role. In
particular, invertebrates, a highly diverse and under-explored group, pose
unique taxonomic complexities. We explore machine learning approaches,
comparing supervised CNNs, fine-tuned foundation models, and a DNA
barcode-specific masking strategy across datasets of varying complexity. While
simpler datasets and tasks favor supervised CNNs or fine-tuned transformers,
challenging species-level identification demands a paradigm shift towards
self-supervised pretraining. We propose BarcodeBERT, the first self-supervised
method for general biodiversity analysis, leveraging a 1.5 M invertebrate DNA
barcode reference library. This work highlights how dataset specifics and
coverage impact model selection, and underscores the role of self-supervised
pretraining in achieving high-accuracy DNA barcode-based identification at the
species and genus level. Indeed, without the fine-tuning step, BarcodeBERT
pretrained on a large DNA barcode dataset outperforms DNABERT and DNABERT-2 on
multiple downstream classification tasks. The code repository is available at
https://github.com/Kari-Genomics-Lab/BarcodeBERT
Related papers
- Improving Taxonomic Image-based Out-of-distribution Detection With DNA Barcodes [6.1593136743688355]
We study if DNA barcodes can also support in finding the outlier images based on the outlier DNA sequence's similarity to the seen classes.
We experimentally show that the proposed approach improves taxonomic OOD detection compared to all common baselines.
arXiv Detail & Related papers (2024-06-27T08:39:16Z) - BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity [19.003642885871546]
BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens.
We propose three benchmark experiments to demonstrate the impact of the multi-modal data types on the classification and clustering accuracy.
arXiv Detail & Related papers (2024-06-18T15:45:21Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - DNABERT-S: Pioneering Species Differentiation with Species-Aware DNA Embeddings [7.822348354050447]
We introduce DNABERT-S, a tailored genome model that develops species-aware embeddings to naturally cluster and segregate DNA sequences of different species.
Emerged results on 23 diverse datasets show DNABERT-S's effectiveness, especially in realistic label-scarce scenarios.
arXiv Detail & Related papers (2024-02-13T20:21:29Z) - BEND: Benchmarking DNA Language Models on biologically meaningful tasks [7.005668635562045]
We introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks.
We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features.
arXiv Detail & Related papers (2023-11-21T12:34:00Z) - Embed-Search-Align: DNA Sequence Alignment using Transformer Models [2.48439258515764]
We bridge the gap by framing the sequence alignment task for Transformer models as an "Embed-Search-Align" task.
A novel Reference-Free DNA Embedding model generates embeddings of reads and reference fragments, which are projected into a shared vector space.
DNA-ESA is 99% accurate when aligning 250-length reads onto a human genome (3gb), rivaling conventional methods such as Bowtie and BWA-Mem.
arXiv Detail & Related papers (2023-09-20T06:30:39Z) - HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide
Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level.
On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z) - HAT: Hierarchical Aggregation Transformers for Person Re-identification [87.02828084991062]
We take advantages of both CNNs and Transformers for image-based person Re-ID with high performance.
Work is the first to take advantages of both CNNs and Transformers for image-based person Re-ID.
arXiv Detail & Related papers (2021-07-13T09:34:54Z) - G-MIND: An End-to-End Multimodal Imaging-Genetics Framework for
Biomarker Identification and Disease Classification [49.53651166356737]
We propose a novel deep neural network architecture to integrate imaging and genetics data, as guided by diagnosis, that provides interpretable biomarkers.
We have evaluated our model on a population study of schizophrenia that includes two functional MRI (fMRI) paradigms and Single Nucleotide Polymorphism (SNP) data.
arXiv Detail & Related papers (2021-01-27T19:28:04Z) - Semi-supervised deep learning based on label propagation in a 2D
embedded space [117.9296191012968]
Proposed solutions propagate labels from a small set of supervised images to a large set of unsupervised ones to train a deep neural network model.
We present a loop in which a deep neural network (VGG-16) is trained from a set with more correctly labeled samples along iterations.
As the labeled set improves along iterations, it improves the features of the neural network.
arXiv Detail & Related papers (2020-08-02T20:08:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.