BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity
- URL: http://arxiv.org/abs/2406.12723v4
- Date: Wed, 13 Nov 2024 01:45:11 GMT
- Title: BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity
- Authors: Zahra Gharaee, Scott C. Lowe, ZeMing Gong, Pablo Millan Arias, Nicholas Pellegrino, Austin T. Wang, Joakim Bruslund Haurum, Iuliia Zarubiieva, Lila Kari, Dirk Steinke, Graham W. Taylor, Paul Fieguth, Angel X. Chang,
- Abstract summary: BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens.
We propose three benchmark experiments to demonstrate the impact of the multi-modal data types on the classification and clustering accuracy.
- Score: 19.003642885871546
- License:
- Abstract: As part of an ongoing worldwide effort to comprehend and monitor insect biodiversity, this paper presents the BIOSCAN-5M Insect dataset to the machine learning community and establish several benchmark tasks. BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens, and it significantly expands existing image-based biological datasets by including taxonomic labels, raw nucleotide barcode sequences, assigned barcode index numbers, geographical, and size information. We propose three benchmark experiments to demonstrate the impact of the multi-modal data types on the classification and clustering accuracy. First, we pretrain a masked language model on the DNA barcode sequences of the BIOSCAN-5M dataset, and demonstrate the impact of using this large reference library on species- and genus-level classification performance. Second, we propose a zero-shot transfer learning task applied to images and DNA barcodes to cluster feature embeddings obtained from self-supervised learning, to investigate whether meaningful clusters can be derived from these representation embeddings. Third, we benchmark multi-modality by performing contrastive learning on DNA barcodes, image data, and taxonomic information. This yields a general shared embedding space enabling taxonomic classification using multiple types of information and modalities. The code repository of the BIOSCAN-5M Insect dataset is available at https://github.com/bioscan-ml/BIOSCAN-5M.
Related papers
- FungiTastic: A multi-modal dataset and benchmark for image categorization [21.01939456569417]
We introduce a new benchmark and a dataset, FungiTastic, based on fungal records continuously collected over a twenty-year span.
The dataset is labeled and curated by experts and consists of about 350k multimodal observations of 5k fine-grained categories (species)
FungiTastic is one of the few benchmarks that include a test set with DNA-sequenced ground truth of unprecedented label reliability.
arXiv Detail & Related papers (2024-08-24T17:22:46Z) - CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale [21.995678534789615]
We use contrastive learning to align images, barcode DNA, and text-based representations of taxonomic labels in a unified embedding space.
Our method surpasses previous single-modality approaches in accuracy by over 8% on zero-shot learning tasks.
arXiv Detail & Related papers (2024-05-27T17:57:48Z) - UniCell: Universal Cell Nucleus Classification via Prompt Learning [76.11864242047074]
We propose a universal cell nucleus classification framework (UniCell)
It employs a novel prompt learning mechanism to uniformly predict the corresponding categories of pathological images from different dataset domains.
In particular, our framework adopts an end-to-end architecture for nuclei detection and classification, and utilizes flexible prediction heads for adapting various datasets.
arXiv Detail & Related papers (2024-02-20T11:50:27Z) - BarcodeBERT: Transformers for Biodiversity Analysis [19.082058886309028]
We propose BarcodeBERT, the first self-supervised method for general biodiversity analysis.
BarcodeBERT pretrained on a large DNA barcode dataset outperforms DNABERT and DNABERT-2 on multiple downstream classification tasks.
arXiv Detail & Related papers (2023-11-04T13:25:49Z) - A Step Towards Worldwide Biodiversity Assessment: The BIOSCAN-1M Insect
Dataset [18.211840156134784]
This paper presents a curated million-image dataset, primarily to train computer-vision models capable of providing image-based taxonomic assessment.
The dataset also presents compelling characteristics, the study of which would be of interest to the broader machine learning community.
arXiv Detail & Related papers (2023-07-19T20:54:08Z) - Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE)
M3AE learns a unified encoder for both vision and language data via masked token prediction.
We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z) - Bamboo: Building Mega-Scale Vision Dataset Continually with
Human-Machine Synergy [69.07918114341298]
Large-scale datasets play a vital role in computer vision.
Existing datasets are either collected according to label systems or blindly without differentiation to samples, making them inefficient and unscalable.
We advocate building a high-quality vision dataset actively annotated and continually on a comprehensive label system.
arXiv Detail & Related papers (2022-03-15T13:01:00Z) - One Model is All You Need: Multi-Task Learning Enables Simultaneous
Histology Image Segmentation and Classification [3.8725005247905386]
We present a multi-task learning approach for segmentation and classification of tissue regions.
We enable simultaneous prediction with a single network.
As a result of feature sharing, we also show that the learned representation can be used to improve downstream tasks.
arXiv Detail & Related papers (2022-02-28T20:22:39Z) - G-MIND: An End-to-End Multimodal Imaging-Genetics Framework for
Biomarker Identification and Disease Classification [49.53651166356737]
We propose a novel deep neural network architecture to integrate imaging and genetics data, as guided by diagnosis, that provides interpretable biomarkers.
We have evaluated our model on a population study of schizophrenia that includes two functional MRI (fMRI) paradigms and Single Nucleotide Polymorphism (SNP) data.
arXiv Detail & Related papers (2021-01-27T19:28:04Z) - A Trainable Optimal Transport Embedding for Feature Aggregation and its
Relationship to Attention [96.77554122595578]
We introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference.
Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost.
arXiv Detail & Related papers (2020-06-22T08:35:58Z) - Automatic image-based identification and biomass estimation of
invertebrates [70.08255822611812]
Time-consuming sorting and identification of taxa pose strong limitations on how many insect samples can be processed.
We propose to replace the standard manual approach of human expert-based sorting and identification with an automatic image-based technology.
We use state-of-the-art Resnet-50 and InceptionV3 CNNs for the classification task.
arXiv Detail & Related papers (2020-02-05T21:38:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.