A Step Towards Worldwide Biodiversity Assessment: The BIOSCAN-1M Insect
Dataset
- URL: http://arxiv.org/abs/2307.10455v3
- Date: Mon, 13 Nov 2023 18:10:23 GMT
- Title: A Step Towards Worldwide Biodiversity Assessment: The BIOSCAN-1M Insect
Dataset
- Authors: Zahra Gharaee, ZeMing Gong, Nicholas Pellegrino, Iuliia Zarubiieva,
Joakim Bruslund Haurum, Scott C. Lowe, Jaclyn T.A. McKeown, Chris C.Y. Ho,
Joschka McLeod, Yi-Yun C Wei, Jireh Agda, Sujeevan Ratnasingham, Dirk
Steinke, Angel X. Chang, Graham W. Taylor, Paul Fieguth
- Abstract summary: This paper presents a curated million-image dataset, primarily to train computer-vision models capable of providing image-based taxonomic assessment.
The dataset also presents compelling characteristics, the study of which would be of interest to the broader machine learning community.
- Score: 18.211840156134784
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In an effort to catalog insect biodiversity, we propose a new large dataset
of hand-labelled insect images, the BIOSCAN-Insect Dataset. Each record is
taxonomically classified by an expert, and also has associated genetic
information including raw nucleotide barcode sequences and assigned barcode
index numbers, which are genetically-based proxies for species classification.
This paper presents a curated million-image dataset, primarily to train
computer-vision models capable of providing image-based taxonomic assessment,
however, the dataset also presents compelling characteristics, the study of
which would be of interest to the broader machine learning community. Driven by
the biological nature inherent to the dataset, a characteristic long-tailed
class-imbalance distribution is exhibited. Furthermore, taxonomic labelling is
a hierarchical classification scheme, presenting a highly fine-grained
classification problem at lower levels. Beyond spurring interest in
biodiversity research within the machine learning community, progress on
creating an image-based taxonomic classifier will also further the ultimate
goal of all BIOSCAN research: to lay the foundation for a comprehensive survey
of global biodiversity. This paper introduces the dataset and explores the
classification task through the implementation and analysis of a baseline
classifier.
Related papers
- BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity [19.003642885871546]
BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens.
We propose three benchmark experiments to demonstrate the impact of the multi-modal data types on the classification and clustering accuracy.
arXiv Detail & Related papers (2024-06-18T15:45:21Z) - Insect Identification in the Wild: The AMI Dataset [35.41544843896443]
Insects represent half of all global biodiversity, yet many of the world's insects are disappearing.
Despite this crisis, data on insect diversity and abundance remain woefully inadequate.
We provide the first large-scale machine learning benchmarks for fine-grained insect recognition.
arXiv Detail & Related papers (2024-06-18T09:57:02Z) - UniCell: Universal Cell Nucleus Classification via Prompt Learning [76.11864242047074]
We propose a universal cell nucleus classification framework (UniCell)
It employs a novel prompt learning mechanism to uniformly predict the corresponding categories of pathological images from different dataset domains.
In particular, our framework adopts an end-to-end architecture for nuclei detection and classification, and utilizes flexible prediction heads for adapting various datasets.
arXiv Detail & Related papers (2024-02-20T11:50:27Z) - TEPI: Taxonomy-aware Embedding and Pseudo-Imaging for Scarcely-labeled
Zero-shot Genome Classification [0.0]
A species' genetic code or genome encodes valuable evolutionary, biological, and phylogenetic information.
Traditional bioinformatics tools have made notable progress but lack scalability and are computationally expensive.
We propose addressing this problem through zero-shot learning using TEPI, taxonomy-aware Embedding and Pseudo-Imaging.
arXiv Detail & Related papers (2024-01-24T04:16:28Z) - BI-LAVA: Biocuration with Hierarchical Image Labeling through Active
Learning and Visual Analysis [2.859324824091085]
BI-LAVA is a system for organizing scientific images in hierarchical structures.
It uses a small set of image labels, a hierarchical set of image classifiers, and active learning to help model builders deal with incomplete ground-truth labels.
An evaluation shows that our mixed human-machine approach successfully supports domain experts in understanding the characteristics of classes within the taxonomy.
arXiv Detail & Related papers (2023-08-15T19:36:19Z) - Information Gain Sampling for Active Learning in Medical Image
Classification [3.1619162190378787]
This work presents an information-theoretic active learning framework that guides the optimal selection of images from the unlabelled pool to be labeled.
Experiments are performed on two different medical image classification datasets.
arXiv Detail & Related papers (2022-08-01T16:25:53Z) - Structure-Enhanced Meta-Learning For Few-Shot Graph Classification [53.54066611743269]
This work explores the potential of metric-based meta-learning for solving few-shot graph classification.
An implementation upon GIN, named SMFGIN, is tested on two datasets, Chembl and TRIANGLES.
arXiv Detail & Related papers (2021-03-05T09:03:03Z) - G-MIND: An End-to-End Multimodal Imaging-Genetics Framework for
Biomarker Identification and Disease Classification [49.53651166356737]
We propose a novel deep neural network architecture to integrate imaging and genetics data, as guided by diagnosis, that provides interpretable biomarkers.
We have evaluated our model on a population study of schizophrenia that includes two functional MRI (fMRI) paradigms and Single Nucleotide Polymorphism (SNP) data.
arXiv Detail & Related papers (2021-01-27T19:28:04Z) - Deep Low-Shot Learning for Biological Image Classification and
Visualization from Limited Training Samples [52.549928980694695]
In situ hybridization (ISH) gene expression pattern images from the same developmental stage are compared.
labeling training data with precise stages is very time-consuming even for biologists.
We propose a deep two-step low-shot learning framework to accurately classify ISH images using limited training images.
arXiv Detail & Related papers (2020-10-20T06:06:06Z) - Two-View Fine-grained Classification of Plant Species [66.75915278733197]
We propose a novel method based on a two-view leaf image representation and a hierarchical classification strategy for fine-grained recognition of plant species.
A deep metric based on Siamese convolutional neural networks is used to reduce the dependence on a large number of training samples and make the method scalable to new plant species.
arXiv Detail & Related papers (2020-05-18T21:57:47Z) - Automatic image-based identification and biomass estimation of
invertebrates [70.08255822611812]
Time-consuming sorting and identification of taxa pose strong limitations on how many insect samples can be processed.
We propose to replace the standard manual approach of human expert-based sorting and identification with an automatic image-based technology.
We use state-of-the-art Resnet-50 and InceptionV3 CNNs for the classification task.
arXiv Detail & Related papers (2020-02-05T21:38:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.