BioTrove: A Large Curated Image Dataset Enabling AI for Biodiversity
- URL: http://arxiv.org/abs/2406.17720v2
- Date: Mon, 27 Jan 2025 20:06:18 GMT
- Title: BioTrove: A Large Curated Image Dataset Enabling AI for Biodiversity
- Authors: Chih-Hsuan Yang, Benjamin Feuer, Zaki Jubery, Zi K. Deng, Andre Nakkab, Md Zahid Hasan, Shivani Chiranjeevi, Kelly Marshall, Nirmal Baishnab, Asheesh K Singh, Arti Singh, Soumik Sarkar, Nirav Merchant, Chinmay Hegde, Baskar Ganapathysubramanian,
- Abstract summary: BioTrove is the largest dataset designed to advance AI applications in biodiversity.
It contains 161.9 million images, offering unprecedented scale and diversity from three primary kingdoms.
Each image is annotated with scientific names, taxonomic hierarchies, and common names.
- Score: 14.949271003068107
- License:
- Abstract: We introduce BioTrove, the largest publicly accessible dataset designed to advance AI applications in biodiversity. Curated from the iNaturalist platform and vetted to include only research-grade data, BioTrove contains 161.9 million images, offering unprecedented scale and diversity from three primary kingdoms: Animalia ("animals"), Fungi ("fungi"), and Plantae ("plants"), spanning approximately 366.6K species. Each image is annotated with scientific names, taxonomic hierarchies, and common names, providing rich metadata to support accurate AI model development across diverse species and ecosystems. We demonstrate the value of BioTrove by releasing a suite of CLIP models trained using a subset of 40 million captioned images, known as BioTrove-Train. This subset focuses on seven categories within the dataset that are underrepresented in standard image recognition models, selected for their critical role in biodiversity and agriculture: Aves ("birds"), Arachnida ("spiders/ticks/mites"), Insecta ("insects"), Plantae ("plants"), Fungi ("fungi"), Mollusca ("snails"), and Reptilia ("snakes/lizards"). To support rigorous assessment, we introduce several new benchmarks and report model accuracy for zero-shot learning across life stages, rare species, confounding species, and multiple taxonomic levels. We anticipate that BioTrove will spur the development of AI models capable of supporting digital tools for pest control, crop monitoring, biodiversity assessment, and environmental conservation. These advancements are crucial for ensuring food security, preserving ecosystems, and mitigating the impacts of climate change. BioTrove is publicly available, easily accessible, and ready for immediate use.
Related papers
- Artificial Immune System of Secure Face Recognition Against Adversarial Attacks [67.31542713498627]
optimisation is required for insect production to realise its full potential.
This can be by targeted improvement of traits of interest through selective breeding.
This review combines knowledge from diverse disciplines, bridging the gap between animal breeding, quantitative genetics, evolutionary biology, and entomology.
arXiv Detail & Related papers (2024-06-26T07:50:58Z) - OpenAnimalTracks: A Dataset for Animal Track Recognition [2.3020018305241337]
We introduce OpenAnimalTracks dataset, the first publicly available labeled dataset designed to facilitate the automated classification and detection of animal footprints.
We show the potential of automated footprint identification with representative classifiers and detection models.
We hope our dataset paves the way for automated animal tracking techniques, enhancing our ability to protect and manage biodiversity.
arXiv Detail & Related papers (2024-06-14T00:37:17Z) - BioCLIP: A Vision Foundation Model for the Tree of Life [34.187429586642146]
We release TreeOfLife-10M, the largest and most diverse ML-ready dataset of biology images.
We then develop BioCLIP, a foundation model for the tree of life.
We rigorously benchmark our approach on diverse fine-grained biology classification tasks.
arXiv Detail & Related papers (2023-11-30T18:49:43Z) - SatBird: Bird Species Distribution Modeling with Remote Sensing and
Citizen Science Data [68.2366021016172]
We present SatBird, a satellite dataset of locations in the USA with labels derived from presence-absence observation data from the citizen science database eBird.
We also provide a dataset in Kenya representing low-data regimes.
We benchmark a set of baselines on our dataset, including SOTA models for remote sensing tasks.
arXiv Detail & Related papers (2023-11-02T02:00:27Z) - A Step Towards Worldwide Biodiversity Assessment: The BIOSCAN-1M Insect
Dataset [18.211840156134784]
This paper presents a curated million-image dataset, primarily to train computer-vision models capable of providing image-based taxonomic assessment.
The dataset also presents compelling characteristics, the study of which would be of interest to the broader machine learning community.
arXiv Detail & Related papers (2023-07-19T20:54:08Z) - Biomaker CA: a Biome Maker project using Cellular Automata [69.82087064086666]
We introduce Biomaker CA: a Biome Maker project using Cellular Automata (CA)
In Biomaker CA, morphogenesis is a first class citizen and small seeds need to grow into plant-like organisms to survive in a nutrient starved environment.
We show how this project allows for several different kinds of environments and laws of 'physics', alongside different model architectures and mutation strategies.
arXiv Detail & Related papers (2023-07-18T15:03:40Z) - BioGPT: Generative Pre-trained Transformer for Biomedical Text
Generation and Mining [140.61707108174247]
We propose BioGPT, a domain-specific generative Transformer language model pre-trained on large scale biomedical literature.
We get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks respectively, and 78.2% accuracy on PubMedQA.
arXiv Detail & Related papers (2022-10-19T07:17:39Z) - Ensembles of Vision Transformers as a New Paradigm for Automated
Classification in Ecology [0.0]
We show that ensembles of Data-efficient image Transformers (DeiTs) significantly outperform the previous state of the art (SOTA)
On all the data sets we test, we achieve a new SOTA, with a reduction of the error with respect to the previous SOTA ranging from 18.48% to 87.50%.
arXiv Detail & Related papers (2022-03-03T14:16:22Z) - Florida Wildlife Camera Trap Dataset [48.99466876948454]
We introduce a challenging wildlife camera trap classification dataset collected from two different locations in Southwestern Florida.
The dataset consists of 104,495 images featuring visually similar species, varying illumination conditions, skewed class distribution, and including samples of endangered species.
arXiv Detail & Related papers (2021-06-23T18:53:15Z) - Automatic image-based identification and biomass estimation of
invertebrates [70.08255822611812]
Time-consuming sorting and identification of taxa pose strong limitations on how many insect samples can be processed.
We propose to replace the standard manual approach of human expert-based sorting and identification with an automatic image-based technology.
We use state-of-the-art Resnet-50 and InceptionV3 CNNs for the classification task.
arXiv Detail & Related papers (2020-02-05T21:38:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.