BarcodeMamba: State Space Models for Biodiversity Analysis
- URL: http://arxiv.org/abs/2412.11084v1
- Date: Sun, 15 Dec 2024 06:52:18 GMT
- Title: BarcodeMamba: State Space Models for Biodiversity Analysis
- Authors: Tiancheng Gao, Graham W. Taylor,
- Abstract summary: BarcodeMamba is a performant and efficient foundation model for DNA barcodes in biodiversity analysis.<n>Our study shows that BarcodeMamba has better performance than BarcodeBERT even when using only 8.3% as many parameters.<n>In our scaling study, BarcodeMamba with 63.6% of BarcodeBERT's parameters achieved 70.2% genus-level accuracy in 1-nearest neighbor (1-NN) probing for unseen species.
- Score: 14.524535359259414
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: DNA barcodes are crucial in biodiversity analysis for building automatic identification systems that recognize known species and discover unseen species. Unlike human genome modeling, barcode-based invertebrate identification poses challenges in the vast diversity of species and taxonomic complexity. Among Transformer-based foundation models, BarcodeBERT excelled in species-level identification of invertebrates, highlighting the effectiveness of self-supervised pretraining on barcode-specific datasets. Recently, structured state space models (SSMs) have emerged, with a time complexity that scales sub-quadratically with the context length. SSMs provide an efficient parameterization of sequence modeling relative to attention-based architectures. Given the success of Mamba and Mamba-2 in natural language, we designed BarcodeMamba, a performant and efficient foundation model for DNA barcodes in biodiversity analysis. We conducted a comprehensive ablation study on the impacts of self-supervised training and tokenization methods, and compared both versions of Mamba layers in terms of expressiveness and their capacity to identify "unseen" species held back from training. Our study shows that BarcodeMamba has better performance than BarcodeBERT even when using only 8.3% as many parameters, and improves accuracy to 99.2% on species-level accuracy in linear probing without fine-tuning for "seen" species. In our scaling study, BarcodeMamba with 63.6% of BarcodeBERT's parameters achieved 70.2% genus-level accuracy in 1-nearest neighbor (1-NN) probing for unseen species. The code repository to reproduce our experiments is available at https://github.com/bioscan-ml/BarcodeMamba.
Related papers
- Bidirectional Mamba for Single-Cell Data: Efficient Context Learning with Biological Fidelity [0.39945675027960637]
We introduce GeneMamba, a scalable and efficient foundation model for single-cell transcriptomics built on state space modeling.
GeneMamba captures bidirectional gene context with linear-time complexity, offering substantial computational gains over transformer baselines.
We evaluate GeneMamba across diverse tasks, including multi-batch integration, cell type annotation, and gene-gene correlation, demonstrating strong performance, interpretability, and robustness.
arXiv Detail & Related papers (2025-04-22T20:34:47Z) - TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba [88.31117598044725]
We explore cross-architecture training to transfer the ready knowledge in existing Transformer models to alternative architecture Mamba, termed TransMamba.
Our approach employs a two-stage strategy to expedite training new Mamba models, ensuring effectiveness in across uni-modal and cross-modal tasks.
For cross-modal learning, we propose a cross-Mamba module that integrates language awareness into Mamba's visual features, enhancing the cross-modal interaction capabilities of Mamba architecture.
arXiv Detail & Related papers (2025-02-21T01:22:01Z) - Mamba-SEUNet: Mamba UNet for Monaural Speech Enhancement [54.427965535613886]
Mamba, as a novel state-space model (SSM), has gained widespread application in natural language processing and computer vision.
In this work, we introduce Mamba-SEUNet, an innovative architecture that integrates Mamba with U-Net for SE tasks.
arXiv Detail & Related papers (2024-12-21T13:43:51Z) - BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity [19.003642885871546]
BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens.
We propose three benchmark experiments to demonstrate the impact of the multi-modal data types on the classification and clustering accuracy.
arXiv Detail & Related papers (2024-06-18T15:45:21Z) - An Empirical Study of Mamba-based Language Models [69.74383762508805]
Selective state-space models (SSMs) like Mamba overcome some shortcomings of Transformers.
We present a direct comparison between 8B-context Mamba, Mamba-2, and Transformer models trained on the same datasets.
We find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks.
arXiv Detail & Related papers (2024-06-12T05:25:15Z) - CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale [21.995678534789615]
We use contrastive learning to align images, barcode DNA, and text-based representations of taxonomic labels in a unified embedding space.
Our method surpasses previous single-modality approaches in accuracy by over 8% on zero-shot learning tasks.
arXiv Detail & Related papers (2024-05-27T17:57:48Z) - MambaAD: Exploring State Space Models for Multi-class Unsupervised Anomaly Detection [53.03687787922032]
Mamba-based models with superior long-range modeling and linear efficiency have garnered substantial attention.
MambaAD consists of a pre-trained encoder and a Mamba decoder featuring (Locality-Enhanced State Space) LSS modules at multi-scales.
The proposed LSS module, integrating parallel cascaded (Hybrid State Space) HSS blocks and multi- kernel convolutions operations, effectively captures both long-range and local information.
arXiv Detail & Related papers (2024-04-09T18:28:55Z) - MambaByte: Token-free Selective State Space Model [71.90159903595514]
MambaByte is a token-free adaptation of the Mamba SSM trained autoregressively on byte sequences.
We show MambaByte to be competitive with, and even to outperform, state-of-the-art subword Transformers on language modeling tasks.
arXiv Detail & Related papers (2024-01-24T18:53:53Z) - BarcodeBERT: Transformers for Biodiversity Analysis [19.082058886309028]
We propose BarcodeBERT, the first self-supervised method for general biodiversity analysis.
BarcodeBERT pretrained on a large DNA barcode dataset outperforms DNABERT and DNABERT-2 on multiple downstream classification tasks.
arXiv Detail & Related papers (2023-11-04T13:25:49Z) - The effect of data augmentation and 3D-CNN depth on Alzheimer's Disease
detection [51.697248252191265]
This work summarizes and strictly observes best practices regarding data handling, experimental design, and model evaluation.
We focus on Alzheimer's Disease (AD) detection, which serves as a paradigmatic example of challenging problem in healthcare.
Within this framework, we train predictive 15 models, considering three different data augmentation strategies and five distinct 3D CNN architectures.
arXiv Detail & Related papers (2023-09-13T10:40:41Z) - Towards ML Methods for Biodiversity: A Novel Wild Bee Dataset and
Evaluations of XAI Methods for ML-Assisted Rare Species Annotations [3.947933139348889]
Insects are a crucial part of our ecosystem. Sadly, in the past few decades, their numbers have worryingly decreased.
In an attempt to gain a better understanding of this process and monitor the insects populations, Deep Learning may offer viable solutions.
This paper presents a dataset of thoroughly annotated images of wild bees sampled from the iNaturalist database.
A ResNet model trained on the wild bee dataset achieving classification scores comparable to similar state-of-the-art models trained on other fine-grained datasets.
arXiv Detail & Related papers (2022-06-15T12:48:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.