BarcodeMamba+: Advancing State-Space Models for Fungal Biodiversity Research
- URL: http://arxiv.org/abs/2512.15931v1
- Date: Wed, 17 Dec 2025 19:56:03 GMT
- Title: BarcodeMamba+: Advancing State-Space Models for Fungal Biodiversity Research
- Authors: Tiancheng Gao, Scott C. Lowe, Brendan Furneaux, Angel X Chang, Graham W. Taylor,
- Abstract summary: We introduce a foundation model for fungal barcode classification built on a powerful and efficient state-space model architecture.<n>We demonstrate this is substantially more effective than traditional fully-supervised methods in this data-sparse environment.<n>Our work provides a powerful new tool for genomics-based biodiversity research.
- Score: 19.401485543915452
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurate taxonomic classification from DNA barcodes is a cornerstone of global biodiversity monitoring, yet fungi present extreme challenges due to sparse labelling and long-tailed taxa distributions. Conventional supervised learning methods often falter in this domain, struggling to generalize to unseen species and to capture the hierarchical nature of the data. To address these limitations, we introduce BarcodeMamba+, a foundation model for fungal barcode classification built on a powerful and efficient state-space model architecture. We employ a pretrain and fine-tune paradigm, which utilizes partially labelled data and we demonstrate this is substantially more effective than traditional fully-supervised methods in this data-sparse environment. During fine-tuning, we systematically integrate and evaluate a suite of enhancements--including hierarchical label smoothing, a weighted loss function, and a multi-head output layer from MycoAI--to specifically tackle the challenges of fungal taxonomy. Our experiments show that each of these components yields significant performance gains. On a challenging fungal classification benchmark with distinct taxonomic distribution shifts from the broad training set, our final model outperforms a range of existing methods across all taxonomic levels. Our work provides a powerful new tool for genomics-based biodiversity research and establishes an effective and scalable training paradigm for this challenging domain. Our code is publicly available at https://github.com/bioscan-ml/BarcodeMamba.
Related papers
- Beyond Softmax: A Natural Parameterization for Categorical Random Variables [61.709831225296305]
We introduce the $textitcatnat$ function, a function composed of a sequence of hierarchical binary splits.<n>A rich set of experiments show that the proposed function improves the learning efficiency and yields models characterized by consistently higher test performance.
arXiv Detail & Related papers (2025-09-29T12:55:50Z) - Hyperbolic Multimodal Representation Learning for Biological Taxonomies [23.639218053531962]
Taxonomic classification in biodiversity research involves organizing biological specimens into structured hierarchies based on evidence.<n>We investigate whether hyperbolic networks can provide a better embedding space for such hierarchical models.<n>Our method embeds multimodal inputs into a shared hyperbolic space using contrastive and a novel stacked entailment-based objective.
arXiv Detail & Related papers (2025-08-22T18:52:50Z) - Bridging Classical and Modern Computer Vision: PerceptiveNet for Tree Crown Semantic Segmentation [0.0]
PerceptiveNet is a novel model incorporating a Logarithmic Gabor- parameterised convolutional layer with trainable filter parameters.<n>We investigate the impact of Log-Gabor, Gabor, and standard convolutional layers on semantic segmentation performance.<n>Our results outperform state-of-the-art models, demonstrating significant performance improvements on a tree crown dataset.
arXiv Detail & Related papers (2025-05-29T16:11:08Z) - BarcodeMamba: State Space Models for Biodiversity Analysis [14.524535359259414]
BarcodeMamba is a performant and efficient foundation model for DNA barcodes in biodiversity analysis.<n>Our study shows that BarcodeMamba has better performance than BarcodeBERT even when using only 8.3% as many parameters.<n>In our scaling study, BarcodeMamba with 63.6% of BarcodeBERT's parameters achieved 70.2% genus-level accuracy in 1-nearest neighbor (1-NN) probing for unseen species.
arXiv Detail & Related papers (2024-12-15T06:52:18Z) - A Closer Look at Deep Learning Methods on Tabular Datasets [78.61845513154502]
We present an extensive study on TALENT, a collection of 300+ datasets spanning broad ranges of size.<n>Our evaluation shows that ensembling benefits both tree-based and neural approaches.
arXiv Detail & Related papers (2024-07-01T04:24:07Z) - LayerMatch: Do Pseudo-labels Benefit All Layers? [77.59625180366115]
Semi-supervised learning offers a promising solution to mitigate the dependency of labeled data.
We develop two layer-specific pseudo-label strategies, termed Grad-ReLU and Avg-Clustering.
Our approach consistently demonstrates exceptional performance on standard semi-supervised learning benchmarks.
arXiv Detail & Related papers (2024-06-20T11:25:50Z) - BarcodeBERT: Transformers for Biodiversity Analysis [18.582770076266737]
We introduce BarcodeBERT, a family of models tailored to biodiversity analysis.<n>BarcodeBERT is trained exclusively on data from a reference library of 1.5M invertebrate DNA barcodes.
arXiv Detail & Related papers (2023-11-04T13:25:49Z) - A Step Towards Worldwide Biodiversity Assessment: The BIOSCAN-1M Insect
Dataset [18.211840156134784]
This paper presents a curated million-image dataset, primarily to train computer-vision models capable of providing image-based taxonomic assessment.
The dataset also presents compelling characteristics, the study of which would be of interest to the broader machine learning community.
arXiv Detail & Related papers (2023-07-19T20:54:08Z) - Fast and Functional Structured Data Generators Rooted in Out-of-Equilibrium Physics [44.97217246897902]
We address the challenge of using energy-based models to produce high-quality, label-specific data in structured datasets.
Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing.
We use a novel training algorithm that exploits non-equilibrium effects.
arXiv Detail & Related papers (2023-07-13T15:08:44Z) - Semi-Supervised Domain Generalization with Stochastic StyleMatch [90.98288822165482]
In real-world applications, we might have only a few labels available from each source domain due to high annotation cost.
In this work, we investigate semi-supervised domain generalization, a more realistic and practical setting.
Our proposed approach, StyleMatch, is inspired by FixMatch, a state-of-the-art semi-supervised learning method based on pseudo-labeling.
arXiv Detail & Related papers (2021-06-01T16:00:08Z) - G-MIND: An End-to-End Multimodal Imaging-Genetics Framework for
Biomarker Identification and Disease Classification [49.53651166356737]
We propose a novel deep neural network architecture to integrate imaging and genetics data, as guided by diagnosis, that provides interpretable biomarkers.
We have evaluated our model on a population study of schizophrenia that includes two functional MRI (fMRI) paradigms and Single Nucleotide Polymorphism (SNP) data.
arXiv Detail & Related papers (2021-01-27T19:28:04Z) - Deep Autoencoding Topic Model with Scalable Hybrid Bayesian Inference [55.35176938713946]
We develop deep autoencoding topic model (DATM) that uses a hierarchy of gamma distributions to construct its multi-stochastic-layer generative network.
We propose a Weibull upward-downward variational encoder that deterministically propagates information upward via a deep neural network, followed by a downward generative model.
The efficacy and scalability of our models are demonstrated on both unsupervised and supervised learning tasks on big corpora.
arXiv Detail & Related papers (2020-06-15T22:22:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.