A multi-modal dataset for insect biodiversity with imagery and DNA at the trap and individual level
- URL: http://arxiv.org/abs/2507.06972v1
- Date: Wed, 09 Jul 2025 16:03:06 GMT
- Title: A multi-modal dataset for insect biodiversity with imagery and DNA at the trap and individual level
- Authors: Johanna Orsholm, John Quinto, Hannu Autto, Gaia Banelyte, Nicolas Chazot, Jeremy deWaard, Stephanie deWaard, Arielle Farrell, Brendan Furneaux, Bess Hardwick, Nao Ito, Amlan Kar, Oula Kalttopää, Deirdre Kerdraon, Erik Kristensen, Jaclyn McKeown, Tommi Mononen, Ellen Nein, Hanna Rogers, Tomas Roslin, Paula Schmitz, Jayme Sones, Maija Sujala, Amy Thompson, Evgeny V. Zakharov, Iuliia Zarubiieva, Akshita Gupta, Scott C. Lowe, Graham W. Taylor,
- Abstract summary: We present the Mixed Arthropod Sample and Identification (MassID45) dataset for training automatic classifiers of bulk insect samples.<n>It uniquely combines molecular and imaging data at both the unsorted sample level and the full set of individual specimens.<n>Human annotators, supported by an AI-assisted tool, performed two tasks on bulk images: creating segmentation masks around each individual arthropod and assigning taxonomic labels to over 17 000 specimens.
- Score: 12.817729932901779
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Insects comprise millions of species, many experiencing severe population declines under environmental and habitat changes. High-throughput approaches are crucial for accelerating our understanding of insect diversity, with DNA barcoding and high-resolution imaging showing strong potential for automatic taxonomic classification. However, most image-based approaches rely on individual specimen data, unlike the unsorted bulk samples collected in large-scale ecological surveys. We present the Mixed Arthropod Sample Segmentation and Identification (MassID45) dataset for training automatic classifiers of bulk insect samples. It uniquely combines molecular and imaging data at both the unsorted sample level and the full set of individual specimens. Human annotators, supported by an AI-assisted tool, performed two tasks on bulk images: creating segmentation masks around each individual arthropod and assigning taxonomic labels to over 17 000 specimens. Combining the taxonomic resolution of DNA barcodes with precise abundance estimates of bulk images holds great potential for rapid, large-scale characterization of insect communities. This dataset pushes the boundaries of tiny object detection and instance segmentation, fostering innovation in both ecological and machine learning research.
Related papers
- BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning [51.341003735575335]
We find emergent behaviors in biological vision models via large-scale contrastive vision-language training.<n>We train BioCLIP 2 on TreeOfLife-200M to distinguish different species.<n>We identify emergent properties in the learned embedding space of BioCLIP 2.
arXiv Detail & Related papers (2025-05-29T17:48:20Z) - CrypticBio: A Large Multimodal Dataset for Visually Confusing Biodiversity [3.73232466691291]
We present CrypticBio, the largest publicly available dataset of visually confusing species.<n>Criticized from real-world trends in species misidentification among community annotators of iNaturalist, CrypticBio contains 52K unique cryptic groups spanning 67K species.
arXiv Detail & Related papers (2025-05-16T14:35:56Z) - BeetleVerse: A study on taxonomic classification of ground beetles [0.310688583550805]
Ground beetles are a highly sensitive and speciose biological indicator, making them vital for monitoring biodiversity.<n>In this paper, we evaluate 12 vision models on taxonomic classification across four diverse, long-tailed datasets.
arXiv Detail & Related papers (2025-04-18T01:06:37Z) - A Step Towards Worldwide Biodiversity Assessment: The BIOSCAN-1M Insect
Dataset [18.211840156134784]
This paper presents a curated million-image dataset, primarily to train computer-vision models capable of providing image-based taxonomic assessment.
The dataset also presents compelling characteristics, the study of which would be of interest to the broader machine learning community.
arXiv Detail & Related papers (2023-07-19T20:54:08Z) - Wild Face Anti-Spoofing Challenge 2023: Benchmark and Results [73.98594459933008]
Face anti-spoofing (FAS) is an essential mechanism for safeguarding the integrity of automated face recognition systems.
This limitation can be attributed to the scarcity and lack of diversity in publicly available FAS datasets.
We introduce the Wild Face Anti-Spoofing dataset, a large-scale, diverse FAS dataset collected in unconstrained settings.
arXiv Detail & Related papers (2023-04-12T10:29:42Z) - Dynamic $\eta$-VAEs for quantifying biodiversity by clustering
optically recorded insect signals [0.6091702876917281]
We propose an adaptive variant of the variational autoencoder (VAE) capable of clustering data by phylogenetic groups.
We demonstrate the usefulness of the dynamic $beta$-VAE on optically recorded insect signals from regions of southern Scandinavia.
arXiv Detail & Related papers (2021-02-10T16:14:13Z) - Deep Low-Shot Learning for Biological Image Classification and
Visualization from Limited Training Samples [52.549928980694695]
In situ hybridization (ISH) gene expression pattern images from the same developmental stage are compared.
labeling training data with precise stages is very time-consuming even for biologists.
We propose a deep two-step low-shot learning framework to accurately classify ISH images using limited training images.
arXiv Detail & Related papers (2020-10-20T06:06:06Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z) - Two-View Fine-grained Classification of Plant Species [66.75915278733197]
We propose a novel method based on a two-view leaf image representation and a hierarchical classification strategy for fine-grained recognition of plant species.
A deep metric based on Siamese convolutional neural networks is used to reduce the dependence on a large number of training samples and make the method scalable to new plant species.
arXiv Detail & Related papers (2020-05-18T21:57:47Z) - Automatic image-based identification and biomass estimation of
invertebrates [70.08255822611812]
Time-consuming sorting and identification of taxa pose strong limitations on how many insect samples can be processed.
We propose to replace the standard manual approach of human expert-based sorting and identification with an automatic image-based technology.
We use state-of-the-art Resnet-50 and InceptionV3 CNNs for the classification task.
arXiv Detail & Related papers (2020-02-05T21:38:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.