The Herbarium 2021 Half-Earth Challenge Dataset
- URL: http://arxiv.org/abs/2105.13808v1
- Date: Fri, 28 May 2021 13:24:12 GMT
- Title: The Herbarium 2021 Half-Earth Challenge Dataset
- Authors: Riccardo de Lutio, Damon Little, Barbara Ambrose, Serge Belongie
- Abstract summary: Herbarium sheets present a unique view of the world's botanical history, evolution, and diversity.
With the increased digitisation of herbaria worldwide and the advances in the fine-grained classification domain, there are a lot of opportunities for supporting research in this field.
Existing datasets are either too small, or not diverse enough, in terms of represented taxa, geographic distribution or host institutions.
We present the Herbarium Half-Earth dataset, the largest and most diverse dataset of herbarium specimens to date for automatic taxon recognition.
- Score: 1.1470070927586016
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Herbarium sheets present a unique view of the world's botanical history,
evolution, and diversity. This makes them an all-important data source for
botanical research. With the increased digitisation of herbaria worldwide and
the advances in the fine-grained classification domain that can facilitate
automatic identification of herbarium specimens, there are a lot of
opportunities for supporting research in this field. However, existing datasets
are either too small, or not diverse enough, in terms of represented taxa,
geographic distribution or host institutions. Furthermore, aggregating multiple
datasets is difficult as taxa exist under a multitude of different names and
the taxonomy requires alignment to a common reference. We present the Herbarium
Half-Earth dataset, the largest and most diverse dataset of herbarium specimens
to date for automatic taxon recognition.
Related papers
- Towards Ancient Plant Seed Classification: A Benchmark Dataset and Baseline Model [62.98256440452042]
We construct the first Ancient Plant Seed Image Classification dataset.<n>It contains 8,340 images from 17 genus- or species-level seed categories excavated from 18 archaeological sites across China.<n>In both quantitative and qualitative analyses, our approach surpasses existing state-of-the-art image classification methods, achieving an accuracy of 90.5%.
arXiv Detail & Related papers (2025-12-20T07:18:22Z) - Overview of LifeCLEF Plant Identification task 2020 [2.961584451143903]
The LifeCLEF 2020 Plant Identification challenge (or "PlantCLEF 2020") was designed to evaluate to what extent automated identification on the flora of data deficient regions can be improved by the use of herbarium collections.<n>It is based on a dataset of about 1,000 species mainly focused on the South America's Guiana Shield, an area known to have one of the greatest diversity of plants in the world.<n>The challenge was evaluated as a cross-domain classification task where the training set consist of several hundred thousand herbarium sheets and few thousand of photos to enable learning a mapping between the two domains.
arXiv Detail & Related papers (2025-09-23T06:35:19Z) - Overview of PlantCLEF 2021: cross-domain plant identification [2.961584451143903]
The LifeCLEF 2021 plant identification challenge was designed to assess the extent to which automated identification of flora can be improved by using herbarium collections.<n>It is based on a dataset of about 1,000 species mainly focused on the Guiana Shield of South America.<n>The challenge was evaluated as a cross-domain classification task where the training set consisted of several hundred thousand herbarium sheets and a few thousand photos.
arXiv Detail & Related papers (2025-09-23T06:26:24Z) - A multi-modal dataset for insect biodiversity with imagery and DNA at the trap and individual level [12.817729932901779]
We present the Mixed Arthropod Sample and Identification (MassID45) dataset for training automatic classifiers of bulk insect samples.<n>It uniquely combines molecular and imaging data at both the unsorted sample level and the full set of individual specimens.<n>Human annotators, supported by an AI-assisted tool, performed two tasks on bulk images: creating segmentation masks around each individual arthropod and assigning taxonomic labels to over 17 000 specimens.
arXiv Detail & Related papers (2025-07-09T16:03:06Z) - CrypticBio: A Large Multimodal Dataset for Visually Confusing Biodiversity [3.73232466691291]
We present CrypticBio, the largest publicly available dataset of visually confusing species.<n>Criticized from real-world trends in species misidentification among community annotators of iNaturalist, CrypticBio contains 52K unique cryptic groups spanning 67K species.
arXiv Detail & Related papers (2025-05-16T14:35:56Z) - BioCube: A Multimodal Dataset for Biodiversity Research [0.6749750044497732]
We introduce BioCube, a fine-grained global dataset for ecology and biodiversity research.<n>BioCube incorporates species observations through images, audio recordings and descriptions, environmental DNA, vegetation indices, agricultural, forest, land indicators, and high-resolution climate variables.<n>All observations are geospatially aligned under the WGS84 geodetic system, spanning from 2000 to 2020.
arXiv Detail & Related papers (2025-05-16T09:46:08Z) - iNatAg: Multi-Class Classification Models Enabled by a Large-Scale Benchmark Dataset with 4.7M Images of 2,959 Crop and Weed Species [0.8795327496993479]
We introduce iNatAg, a large-scale image dataset which contains over 4.7 million images of 2,959 distinct crop and weed species.
iNatAg contains data from every continent and accurately reflects the variability of natural image captures and environments.
By combining large-scale species coverage, multi-task labels, and geographic diversity, iNatAg provides a new foundation for building robust, geolocation-aware agricultural classification systems.
arXiv Detail & Related papers (2025-03-25T21:04:42Z) - Few-shot Species Range Estimation [61.60698161072356]
Knowing where a particular species can or cannot be found on Earth is crucial for ecological research and conservation efforts.
We outline a new approach for few-shot species range estimation to address the challenge of accurately estimating the range of a species from limited data.
During inference, our model takes a set of spatial locations as input, along with optional metadata such as text or an image, and outputs a species encoding that can be used to predict the range of a previously unseen species in feed-forward manner.
arXiv Detail & Related papers (2025-02-20T19:13:29Z) - Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity [14.949271003068107]
This dataset includes 134.6 million images, surpassing existing datasets in scale by an order of magnitude.
The dataset encompasses image-language paired data for a diverse set of species from birds (Aves), spiders/ticks/mites (Arachnida), insects (usha), plants (Plantae), fungus/mrooms (Fungi), snails (Mollusca), and snakes/Insectards (Reptilia)
arXiv Detail & Related papers (2024-06-25T17:09:54Z) - SatBird: Bird Species Distribution Modeling with Remote Sensing and
Citizen Science Data [68.2366021016172]
We present SatBird, a satellite dataset of locations in the USA with labels derived from presence-absence observation data from the citizen science database eBird.
We also provide a dataset in Kenya representing low-data regimes.
We benchmark a set of baselines on our dataset, including SOTA models for remote sensing tasks.
arXiv Detail & Related papers (2023-11-02T02:00:27Z) - A Step Towards Worldwide Biodiversity Assessment: The BIOSCAN-1M Insect
Dataset [18.211840156134784]
This paper presents a curated million-image dataset, primarily to train computer-vision models capable of providing image-based taxonomic assessment.
The dataset also presents compelling characteristics, the study of which would be of interest to the broader machine learning community.
arXiv Detail & Related papers (2023-07-19T20:54:08Z) - Spatial Implicit Neural Representations for Global-Scale Species Mapping [72.92028508757281]
Given a set of locations where a species has been observed, the goal is to build a model to predict whether the species is present or absent at any location.
Traditional methods struggle to take advantage of emerging large-scale crowdsourced datasets.
We use Spatial Implicit Neural Representations (SINRs) to jointly estimate the geographical range of 47k species simultaneously.
arXiv Detail & Related papers (2023-06-05T03:36:01Z) - CWD30: A Comprehensive and Holistic Dataset for Crop Weed Recognition in
Precision Agriculture [1.64709990449384]
We present the CWD30 dataset, a large-scale, diverse, holistic, and hierarchical dataset tailored for crop-weed recognition tasks in precision agriculture.
CWD30 comprises over 219,770 high-resolution images of 20 weed species and 10 crop species, encompassing various growth stages, multiple viewing angles, and environmental conditions.
The dataset's hierarchical taxonomy enables fine-grained classification and facilitates the development of more accurate, robust, and generalizable deep learning models.
arXiv Detail & Related papers (2023-05-17T09:39:01Z) - Multi-resolution Outlier Pooling for Sorghum Classification [4.434302808728865]
We introduce the Sorghum-100 dataset, a large dataset of RGB imagery of sorghum captured by a state-of-the-art gantry system.
A new global pooling strategy called Dynamic Outlier Pooling outperforms standard global pooling strategies on this task.
arXiv Detail & Related papers (2021-06-10T13:57:33Z) - Geo-Spatiotemporal Features and Shape-Based Prior Knowledge for
Fine-grained Imbalanced Data Classification [63.916371837696396]
Fine-grained classification aims at distinguishing between items with similar global perception and patterns, but that differ by minute details.
Our primary challenges come from both small inter-class variations and large intra-class variations.
We propose to combine several innovations to improve fine-grained classification within the use-case of wildlife.
arXiv Detail & Related papers (2021-03-21T02:01:38Z) - Pollen13K: A Large Scale Microscope Pollen Grain Image Dataset [63.05335933454068]
This work presents the first large-scale pollen grain image dataset, including more than 13 thousands objects.
The paper focuses on the employed data acquisition steps, which include aerobiological sampling, microscope image acquisition, object detection, segmentation and labelling.
arXiv Detail & Related papers (2020-07-09T10:33:31Z) - Two-View Fine-grained Classification of Plant Species [66.75915278733197]
We propose a novel method based on a two-view leaf image representation and a hierarchical classification strategy for fine-grained recognition of plant species.
A deep metric based on Siamese convolutional neural networks is used to reduce the dependence on a large number of training samples and make the method scalable to new plant species.
arXiv Detail & Related papers (2020-05-18T21:57:47Z) - Scalable learning for bridging the species gap in image-based plant
phenotyping [2.208242292882514]
The traditional paradigm of applying deep learning -- collect, annotate and train on data -- is not applicable to image-based plant phenotyping.
Data costs include growing physical samples, imaging and labelling them.
Model performance is impacted by the species gap between the domain of each plant species.
arXiv Detail & Related papers (2020-03-24T10:26:40Z) - Automatic image-based identification and biomass estimation of
invertebrates [70.08255822611812]
Time-consuming sorting and identification of taxa pose strong limitations on how many insect samples can be processed.
We propose to replace the standard manual approach of human expert-based sorting and identification with an automatic image-based technology.
We use state-of-the-art Resnet-50 and InceptionV3 CNNs for the classification task.
arXiv Detail & Related papers (2020-02-05T21:38:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.