GlobalGeoTree: A Multi-Granular Vision-Language Dataset for Global Tree Species Classification
- URL: http://arxiv.org/abs/2505.12513v2
- Date: Sun, 25 May 2025 14:29:15 GMT
- Title: GlobalGeoTree: A Multi-Granular Vision-Language Dataset for Global Tree Species Classification
- Authors: Yang Mu, Zhitong Xiong, Yi Wang, Muhammad Shahzad, Franz Essl, Mark van Kleunen, Xiao Xiang Zhu,
- Abstract summary: We introduce GlobalGeoTree, a comprehensive global dataset for tree species classification.<n>GlobalGeoTree comprises 6.3 million geolocated tree occurrences, spanning 275 families, 2,734 genera, and 21,001 species.<n>We aim to establish a benchmark to advance tree species classification and foster innovation in biodiversity research and ecological applications.
- Score: 21.705561682467152
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Global tree species mapping using remote sensing data is vital for biodiversity monitoring, forest management, and ecological research. However, progress in this field has been constrained by the scarcity of large-scale, labeled datasets. To address this, we introduce GlobalGeoTree, a comprehensive global dataset for tree species classification. GlobalGeoTree comprises 6.3 million geolocated tree occurrences, spanning 275 families, 2,734 genera, and 21,001 species across the hierarchical taxonomic levels. Each sample is paired with Sentinel-2 image time series and 27 auxiliary environmental variables, encompassing bioclimatic, geographic, and soil data. The dataset is partitioned into GlobalGeoTree-6M for model pretraining and curated evaluation subsets, primarily GlobalGeoTree-10kEval for zero-shot and few-shot benchmarking. To demonstrate the utility of the dataset, we introduce a baseline model, GeoTreeCLIP, which leverages paired remote sensing data and taxonomic text labels within a vision-language framework pretrained on GlobalGeoTree-6M. Experimental results show that GeoTreeCLIP achieves substantial improvements in zero- and few-shot classification on GlobalGeoTree-10kEval over existing advanced models. By making the dataset, models, and code publicly available, we aim to establish a benchmark to advance tree species classification and foster innovation in biodiversity research and ecological applications.
Related papers
- HyBiomass: Global Hyperspectral Imagery Benchmark Dataset for Evaluating Geospatial Foundation Models in Forest Aboveground Biomass Estimation [1.0408909053766147]
We introduce a globally distributed benchmark dataset for forest aboveground biomass (AGB) estimation.<n>This benchmark dataset combines co-located hyperspectral imagery (HSI) from the Environmental Mapping and Analysis Program (EnMAP) satellite and predictions of AGB density estimates.<n>Our experimental results on this dataset demonstrate that the evaluated Geo-FMs can match or, in some cases, surpass the performance of a baseline U-Net.
arXiv Detail & Related papers (2025-06-12T21:29:20Z) - The iNaturalist Sounds Dataset [60.157076990024606]
iNatSounds is a collection of 230,000 audio files capturing sounds from over 5,500 species, contributed by more than 27,000 recordists worldwide.<n>The dataset encompasses sounds from birds, mammals, insects, reptiles, and amphibians, with audio and species labels derived from observations submitted to iNaturalist.<n>We envision models trained on this data powering next-generation public engagement applications, and assisting biologists, ecologists, and land use managers in processing large audio collections.
arXiv Detail & Related papers (2025-05-31T02:07:37Z) - BioCube: A Multimodal Dataset for Biodiversity Research [0.6749750044497732]
We introduce BioCube, a fine-grained global dataset for ecology and biodiversity research.<n>BioCube incorporates species observations through images, audio recordings and descriptions, environmental DNA, vegetation indices, agricultural, forest, land indicators, and high-resolution climate variables.<n>All observations are geospatially aligned under the WGS84 geodetic system, spanning from 2000 to 2020.
arXiv Detail & Related papers (2025-05-16T09:46:08Z) - Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework [59.42946541163632]
We introduce a comprehensive geolocation framework with three key components.<n>GeoComp, a large-scale dataset; GeoCoT, a novel reasoning method; and GeoEval, an evaluation metric.<n>We demonstrate that GeoCoT significantly boosts geolocation accuracy by up to 25% while enhancing interpretability.
arXiv Detail & Related papers (2025-02-19T14:21:25Z) - AnySat: One Earth Observation Model for Many Resolutions, Scales, and Modalities [5.767156832161819]
We propose AnySat, a multimodal model based on joint embedding predictive architecture (JEPA) and scale-adaptive spatial encoders.<n>To demonstrate the advantages of this unified approach, we compile GeoPlex, a collection of 5 multimodal datasets with varying characteristics.<n>We then train a single powerful model on these diverse datasets simultaneously.
arXiv Detail & Related papers (2024-12-18T18:11:53Z) - AGBD: A Global-scale Biomass Dataset [18.976975819550173]
Existing datasets for Above Ground Biomass estimation from satellite imagery are limited.<n>This dataset combines AGB reference data from the GEDI mission with data from Sentinel-2 and PALSAR-2 imagery.<n>It includes pre-processed high-level features such as a dense canopy height map, an elevation map, and a land-cover classification map.<n>It can be easily accessed using a single line of code, offering a solid basis for efforts towards global AGB estimation.
arXiv Detail & Related papers (2024-06-07T13:34:17Z) - Planted: a dataset for planted forest identification from multi-satellite time series [23.822292894884427]
We present a dataset consisting of data from five public satellites for recognizing forest plantations and planted tree species across the globe.
The dataset, named PlantD, includes over 2M examples of 64 tree label classes (46 genera and 40 species), distributed among 41 countries.
arXiv Detail & Related papers (2024-05-24T15:49:00Z) - SatBird: Bird Species Distribution Modeling with Remote Sensing and
Citizen Science Data [68.2366021016172]
We present SatBird, a satellite dataset of locations in the USA with labels derived from presence-absence observation data from the citizen science database eBird.
We also provide a dataset in Kenya representing low-data regimes.
We benchmark a set of baselines on our dataset, including SOTA models for remote sensing tasks.
arXiv Detail & Related papers (2023-11-02T02:00:27Z) - Spatial Implicit Neural Representations for Global-Scale Species Mapping [72.92028508757281]
Given a set of locations where a species has been observed, the goal is to build a model to predict whether the species is present or absent at any location.
Traditional methods struggle to take advantage of emerging large-scale crowdsourced datasets.
We use Spatial Implicit Neural Representations (SINRs) to jointly estimate the geographical range of 47k species simultaneously.
arXiv Detail & Related papers (2023-06-05T03:36:01Z) - Hierarchical clustering with dot products recovers hidden tree structure [53.68551192799585]
In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure.
We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance.
We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model.
arXiv Detail & Related papers (2023-05-24T11:05:12Z) - JSRT: James-Stein Regression Tree [55.2059664267247]
Regression tree (RT) has been widely used in machine learning and data mining community.
In practice, the performance of RT relies heavily on the local mean of samples from an individual node during the tree construction/prediction stage.
We propose a novel regression tree, named James-Stein Regression Tree (JSRT) by considering global information from different nodes.
arXiv Detail & Related papers (2020-10-18T16:28:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.