A continental-scale dataset of ground beetles with high-resolution images and validated morphological trait measurements
- URL: http://arxiv.org/abs/2601.10687v1
- Date: Wed, 14 Jan 2026 18:44:54 GMT
- Title: A continental-scale dataset of ground beetles with high-resolution images and validated morphological trait measurements
- Authors: S M Rayeed, Mridul Khurana, Alyson East, Isadora E. Fluck, Elizabeth G. Campolongo, Samuel Stevens, Iuliia Zarubiieva, Scott C. Lowe, Michael W. Denslow, Evan D. Donoso, Jiaman Wu, Michelle Ramirez, Benjamin Baiser, Charles V. Stewart, Paula Mabee, Tanya Berger-Wolf, Anuj Karpatne, Hilmar Lapp, Robert P. Guralnick, Graham W. Taylor, Sydne Record,
- Abstract summary: Ground beetles serve as critical bioindicators of ecosystem health.<n>National Ecological Observatory Network (NEON) maintains an extensive collection of carabid specimens from across the U.S.<n>We present a dataset digitizing over 13,200 NEON carabids from 30 sites spanning the continental US and Hawaii through high-resolution imaging.<n>The dataset includes digitally measured elytra length and width of each specimen, establishing a foundation for automated trait extraction using AI.
- Score: 13.860603856120795
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the ecological significance of invertebrates, global trait databases remain heavily biased toward vertebrates and plants, limiting comprehensive ecological analyses of high-diversity groups like ground beetles. Ground beetles (Coleoptera: Carabidae) serve as critical bioindicators of ecosystem health, providing valuable insights into biodiversity shifts driven by environmental changes. While the National Ecological Observatory Network (NEON) maintains an extensive collection of carabid specimens from across the United States, these primarily exist as physical collections, restricting widespread research access and large-scale analysis. To address these gaps, we present a multimodal dataset digitizing over 13,200 NEON carabids from 30 sites spanning the continental US and Hawaii through high-resolution imaging, enabling broader access and computational analysis. The dataset includes digitally measured elytra length and width of each specimen, establishing a foundation for automated trait extraction using AI. Validated against manual measurements, our digital trait extraction achieves sub-millimeter precision, ensuring reliability for ecological and computational studies. By addressing invertebrate under-representation in trait databases, this work supports AI-driven tools for automated species identification and trait-based research, fostering advancements in biodiversity monitoring and conservation.
Related papers
- CellPainTR: Generalizable Representation Learning for Cross-Dataset Cell Painting Analysis [51.56484100374058]
We introduce CellPainTR, a Transformer-based architecture designed to learn foundational representations of cellular morphology.<n>Our work represents a significant step towards creating truly foundational models for image-based profiling, enabling more reliable and scalable cross-study biological analysis.
arXiv Detail & Related papers (2025-09-02T03:30:07Z) - Automated Detection of Antarctic Benthic Organisms in High-Resolution In Situ Imagery to Aid Biodiversity Monitoring [0.0]
We present a tailored object detection framework for Antarctic benthic organisms in high-resolution towed camera imagery.<n>We show strong performance in detecting medium and large organisms across 25 fine-grained morphotypes.<n>Our framework provides a scalable foundation for future machine-assisted in situ benthic biodiversity monitoring research.
arXiv Detail & Related papers (2025-07-29T10:22:29Z) - A multi-modal dataset for insect biodiversity with imagery and DNA at the trap and individual level [12.817729932901779]
We present the Mixed Arthropod Sample and Identification (MassID45) dataset for training automatic classifiers of bulk insect samples.<n>It uniquely combines molecular and imaging data at both the unsorted sample level and the full set of individual specimens.<n>Human annotators, supported by an AI-assisted tool, performed two tasks on bulk images: creating segmentation masks around each individual arthropod and assigning taxonomic labels to over 17 000 specimens.
arXiv Detail & Related papers (2025-07-09T16:03:06Z) - GreenHyperSpectra: A multi-source hyperspectral dataset for global vegetation trait prediction [13.321623196078276]
We present GreenHyperSpectra, a pretraining dataset encompassing real-world cross-sensor and cross-ecosystem samples.<n>We successfully leverage GreenHyperSpectra to pretrain label-efficient multi-output regression models.<n>Our empirical analyses demonstrate substantial improvements in learning spectral representations for trait prediction.
arXiv Detail & Related papers (2025-07-09T12:51:46Z) - BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning [60.80381372245902]
We find emergent behaviors in biological vision models via large-scale contrastive vision-language training.<n>We train BioCLIP 2 on TreeOfLife-200M to distinguish different species.<n>We identify emergent properties in the learned embedding space of BioCLIP 2.
arXiv Detail & Related papers (2025-05-29T17:48:20Z) - BeetleVerse: A Study on Taxonomic Classification of Ground Beetles [0.310688583550805]
Ground beetles are a highly sensitive and speciose biological indicator, making them vital for monitoring biodiversity.<n>In this paper, we evaluate 12 vision models on taxonomic classification across four diverse, long-tailed datasets.<n>Our results show that the Vision and Language Transformer combined with an head is the best performing model, with 97% accuracy at genus and species level.
arXiv Detail & Related papers (2025-04-18T01:06:37Z) - SatBird: Bird Species Distribution Modeling with Remote Sensing and
Citizen Science Data [68.2366021016172]
We present SatBird, a satellite dataset of locations in the USA with labels derived from presence-absence observation data from the citizen science database eBird.
We also provide a dataset in Kenya representing low-data regimes.
We benchmark a set of baselines on our dataset, including SOTA models for remote sensing tasks.
arXiv Detail & Related papers (2023-11-02T02:00:27Z) - Spatial Implicit Neural Representations for Global-Scale Species Mapping [72.92028508757281]
Given a set of locations where a species has been observed, the goal is to build a model to predict whether the species is present or absent at any location.
Traditional methods struggle to take advantage of emerging large-scale crowdsourced datasets.
We use Spatial Implicit Neural Representations (SINRs) to jointly estimate the geographical range of 47k species simultaneously.
arXiv Detail & Related papers (2023-06-05T03:36:01Z) - Ensembles of Vision Transformers as a New Paradigm for Automated
Classification in Ecology [0.0]
We show that ensembles of Data-efficient image Transformers (DeiTs) significantly outperform the previous state of the art (SOTA)
On all the data sets we test, we achieve a new SOTA, with a reduction of the error with respect to the previous SOTA ranging from 18.48% to 87.50%.
arXiv Detail & Related papers (2022-03-03T14:16:22Z) - Discriminative Singular Spectrum Classifier with Applications on
Bioacoustic Signal Recognition [67.4171845020675]
We present a bioacoustic signal classifier equipped with a discriminative mechanism to extract useful features for analysis and classification efficiently.
Unlike current bioacoustic recognition methods, which are task-oriented, the proposed model relies on transforming the input signals into vector subspaces.
The validity of the proposed method is verified using three challenging bioacoustic datasets containing anuran, bee, and mosquito species.
arXiv Detail & Related papers (2021-03-18T11:01:21Z) - Automatic image-based identification and biomass estimation of
invertebrates [70.08255822611812]
Time-consuming sorting and identification of taxa pose strong limitations on how many insect samples can be processed.
We propose to replace the standard manual approach of human expert-based sorting and identification with an automatic image-based technology.
We use state-of-the-art Resnet-50 and InceptionV3 CNNs for the classification task.
arXiv Detail & Related papers (2020-02-05T21:38:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.