BarcodeMamba: State Space Models for Biodiversity Analysis
- URL: http://arxiv.org/abs/2412.11084v1
- Date: Sun, 15 Dec 2024 06:52:18 GMT
- Title: BarcodeMamba: State Space Models for Biodiversity Analysis
- Authors: Tiancheng Gao, Graham W. Taylor,
- Abstract summary: BarcodeMamba is a performant and efficient foundation model for DNA barcodes in biodiversity analysis.
Our study shows that BarcodeMamba has better performance than BarcodeBERT even when using only 8.3% as many parameters.
In our scaling study, BarcodeMamba with 63.6% of BarcodeBERT's parameters achieved 70.2% genus-level accuracy in 1-nearest neighbor (1-NN) probing for unseen species.
- Score: 14.524535359259414
- License:
- Abstract: DNA barcodes are crucial in biodiversity analysis for building automatic identification systems that recognize known species and discover unseen species. Unlike human genome modeling, barcode-based invertebrate identification poses challenges in the vast diversity of species and taxonomic complexity. Among Transformer-based foundation models, BarcodeBERT excelled in species-level identification of invertebrates, highlighting the effectiveness of self-supervised pretraining on barcode-specific datasets. Recently, structured state space models (SSMs) have emerged, with a time complexity that scales sub-quadratically with the context length. SSMs provide an efficient parameterization of sequence modeling relative to attention-based architectures. Given the success of Mamba and Mamba-2 in natural language, we designed BarcodeMamba, a performant and efficient foundation model for DNA barcodes in biodiversity analysis. We conducted a comprehensive ablation study on the impacts of self-supervised training and tokenization methods, and compared both versions of Mamba layers in terms of expressiveness and their capacity to identify "unseen" species held back from training. Our study shows that BarcodeMamba has better performance than BarcodeBERT even when using only 8.3% as many parameters, and improves accuracy to 99.2% on species-level accuracy in linear probing without fine-tuning for "seen" species. In our scaling study, BarcodeMamba with 63.6% of BarcodeBERT's parameters achieved 70.2% genus-level accuracy in 1-nearest neighbor (1-NN) probing for unseen species. The code repository to reproduce our experiments is available at https://github.com/bioscan-ml/BarcodeMamba.
Related papers
- Mamba-SEUNet: Mamba UNet for Monaural Speech Enhancement [54.427965535613886]
Mamba, as a novel state-space model (SSM), has gained widespread application in natural language processing and computer vision.
In this work, we introduce Mamba-SEUNet, an innovative architecture that integrates Mamba with U-Net for SE tasks.
arXiv Detail & Related papers (2024-12-21T13:43:51Z) - The Mamba in the Llama: Distilling and Accelerating Hybrid Models [76.64055251296548]
We show how to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources.
The resulting hybrid model achieves performance comparable to the original Transformer in chat benchmarks.
We also introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and hybrid models.
arXiv Detail & Related papers (2024-08-27T17:56:11Z) - Mamba-Spike: Enhancing the Mamba Architecture with a Spiking Front-End for Efficient Temporal Data Processing [4.673285689826945]
Mamba-Spike is a novel neuromorphic architecture that integrates a spiking front-end with the Mamba backbone to achieve efficient temporal data processing.
The architecture consistently outperforms state-of-the-art baselines, achieving higher accuracy, lower latency, and improved energy efficiency.
arXiv Detail & Related papers (2024-08-04T14:10:33Z) - BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity [19.003642885871546]
BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens.
We propose three benchmark experiments to demonstrate the impact of the multi-modal data types on the classification and clustering accuracy.
arXiv Detail & Related papers (2024-06-18T15:45:21Z) - An Empirical Study of Mamba-based Language Models [69.74383762508805]
Selective state-space models (SSMs) like Mamba overcome some shortcomings of Transformers.
We present a direct comparison between 8B-context Mamba, Mamba-2, and Transformer models trained on the same datasets.
We find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks.
arXiv Detail & Related papers (2024-06-12T05:25:15Z) - CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale [21.995678534789615]
We use contrastive learning to align images, barcode DNA, and text-based representations of taxonomic labels in a unified embedding space.
Our method surpasses previous single-modality approaches in accuracy by over 8% on zero-shot learning tasks.
arXiv Detail & Related papers (2024-05-27T17:57:48Z) - MambaAD: Exploring State Space Models for Multi-class Unsupervised Anomaly Detection [53.03687787922032]
Mamba-based models with superior long-range modeling and linear efficiency have garnered substantial attention.
MambaAD consists of a pre-trained encoder and a Mamba decoder featuring (Locality-Enhanced State Space) LSS modules at multi-scales.
The proposed LSS module, integrating parallel cascaded (Hybrid State Space) HSS blocks and multi- kernel convolutions operations, effectively captures both long-range and local information.
arXiv Detail & Related papers (2024-04-09T18:28:55Z) - MambaByte: Token-free Selective State Space Model [71.90159903595514]
MambaByte is a token-free adaptation of the Mamba SSM trained autoregressively on byte sequences.
We show MambaByte to be competitive with, and even to outperform, state-of-the-art subword Transformers on language modeling tasks.
arXiv Detail & Related papers (2024-01-24T18:53:53Z) - BarcodeBERT: Transformers for Biodiversity Analysis [18.582770076266737]
We introduce BarcodeBERT, a family of models tailored to biodiversity analysis.
BarcodeBERT is trained exclusively on data from a reference library of 1.5M invertebrate DNA barcodes.
arXiv Detail & Related papers (2023-11-04T13:25:49Z) - The effect of data augmentation and 3D-CNN depth on Alzheimer's Disease
detection [51.697248252191265]
This work summarizes and strictly observes best practices regarding data handling, experimental design, and model evaluation.
We focus on Alzheimer's Disease (AD) detection, which serves as a paradigmatic example of challenging problem in healthcare.
Within this framework, we train predictive 15 models, considering three different data augmentation strategies and five distinct 3D CNN architectures.
arXiv Detail & Related papers (2023-09-13T10:40:41Z) - Towards ML Methods for Biodiversity: A Novel Wild Bee Dataset and
Evaluations of XAI Methods for ML-Assisted Rare Species Annotations [3.947933139348889]
Insects are a crucial part of our ecosystem. Sadly, in the past few decades, their numbers have worryingly decreased.
In an attempt to gain a better understanding of this process and monitor the insects populations, Deep Learning may offer viable solutions.
This paper presents a dataset of thoroughly annotated images of wild bees sampled from the iNaturalist database.
A ResNet model trained on the wild bee dataset achieving classification scores comparable to similar state-of-the-art models trained on other fine-grained datasets.
arXiv Detail & Related papers (2022-06-15T12:48:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.