GEO-Bench-2: From Performance to Capability, Rethinking Evaluation in Geospatial AI
- URL: http://arxiv.org/abs/2511.15658v1
- Date: Wed, 19 Nov 2025 17:45:02 GMT
- Title: GEO-Bench-2: From Performance to Capability, Rethinking Evaluation in Geospatial AI
- Authors: Naomi Simumba, Nils Lehmann, Paolo Fraccaro, Hamed Alemohammad, Geeth De Mel, Salman Khan, Manil Maskey, Nicolas Longepe, Xiao Xiang Zhu, Hannah Kerner, Juan Bernabe-Moreno, Alexander Lacoste,
- Abstract summary: GeoFMs are transforming Earth Observation, but evaluation lacks standardized protocols.<n> GEO-Bench-2 addresses this with a comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation.<n>Code, data, and leaderboard for GEO-Bench-2 are publicly released under a permissive license.
- Score: 52.13138825802668
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Geospatial Foundation Models (GeoFMs) are transforming Earth Observation (EO), but evaluation lacks standardized protocols. GEO-Bench-2 addresses this with a comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation across 19 permissively-licensed datasets. We introduce ''capability'' groups to rank models on datasets that share common characteristics (e.g., resolution, bands, temporality). This enables users to identify which models excel in each capability and determine which areas need improvement in future work. To support both fair comparison and methodological innovation, we define a prescriptive yet flexible evaluation protocol. This not only ensures consistency in benchmarking but also facilitates research into model adaptation strategies, a key and open challenge in advancing GeoFMs for downstream tasks. Our experiments show that no single model dominates across all tasks, confirming the specificity of the choices made during architecture design and pretraining. While models pretrained on natural images (ConvNext ImageNet, DINO V3) excel on high-resolution tasks, EO-specific models (TerraMind, Prithvi, and Clay) outperform them on multispectral applications such as agriculture and disaster response. These findings demonstrate that optimal model choice depends on task requirements, data modalities, and constraints. This shows that the goal of a single GeoFM model that performs well across all tasks remains open for future research. GEO-Bench-2 enables informed, reproducible GeoFM evaluation tailored to specific use cases. Code, data, and leaderboard for GEO-Bench-2 are publicly released under a permissive license.
Related papers
- OFA-MAS: One-for-All Multi-Agent System Topology Design based on Mixture-of-Experts Graph Generative Models [57.94189874119267]
Multi-Agent Systems (MAS) offer a powerful paradigm for solving complex problems.<n>Current graph learning-based design methodologies often adhere to a "one-for-one" paradigm.<n>We propose OFA-TAD, a one-for-all framework that generates adaptive collaboration graphs for any task described in natural language.
arXiv Detail & Related papers (2026-01-19T12:23:44Z) - InstaGeo: Compute-Efficient Geospatial Machine Learning from Data to Deployment [3.6927415209865533]
InstaGeo is an open-source framework for transforming raw satellite imagery into model-ready datasets.<n>We show how InstaGeo can transform raw imagery into model-ready datasets and derive compact, compute-efficient models.<n>We also show how InstaGeo can transform research-grade GFMs into practical, low-carbon tools for real-time, large-scale Earth observation.
arXiv Detail & Related papers (2025-10-07T06:57:15Z) - Adapting Vision-Language Models Without Labels: A Comprehensive Survey [74.17944178027015]
Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities across a wide range of tasks.<n>Recent research has increasingly focused on unsupervised adaptation methods that do not rely on labeled data.<n>We propose a taxonomy based on the availability and nature of unlabeled visual data, categorizing existing approaches into four key paradigms.
arXiv Detail & Related papers (2025-08-07T16:27:37Z) - Deploying Geospatial Foundation Models in the Real World: Lessons from WorldCereal [25.756741188074862]
This paper presents a structured approach to integrate geospatial foundation models into operational mapping systems.<n>Our protocol has three key steps: defining application requirements, adapting the model to domain-specific data and conducting rigorous empirical testing.<n>Results highlight the model's strong spatial and temporal generalization capabilities.
arXiv Detail & Related papers (2025-07-16T15:10:32Z) - HyBiomass: Global Hyperspectral Imagery Benchmark Dataset for Evaluating Geospatial Foundation Models in Forest Aboveground Biomass Estimation [1.0408909053766147]
We introduce a globally distributed benchmark dataset for forest aboveground biomass (AGB) estimation.<n>This benchmark dataset combines co-located hyperspectral imagery (HSI) from the Environmental Mapping and Analysis Program (EnMAP) satellite and predictions of AGB density estimates.<n>Our experimental results on this dataset demonstrate that the evaluated Geo-FMs can match or, in some cases, surpass the performance of a baseline U-Net.
arXiv Detail & Related papers (2025-06-12T21:29:20Z) - PANGAEA: A Global and Inclusive Benchmark for Geospatial Foundation Models [3.9031647202359667]
PANGAEA is a standardized evaluation protocol that covers a diverse set of datasets, tasks, resolutions, sensor modalities, and temporalities.<n>We evaluate the most popular GFMs openly available on this benchmark and analyze their performance across several domains.<n>Our findings highlight the limitations of GFMs, under different scenarios, showing that they do not consistently outperform supervised models.
arXiv Detail & Related papers (2024-12-05T14:40:41Z) - GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks [84.86699025256705]
We present GEOBench-VLM, a benchmark specifically designed to evaluate Vision-Language Models (VLMs) on geospatial tasks.<n>Our benchmark features over 10,000 manually verified instructions and spanning diverse visual conditions, object types, and scales.<n>We evaluate several state-of-the-art VLMs to assess performance on geospatial-specific challenges.
arXiv Detail & Related papers (2024-11-28T18:59:56Z) - GOMAA-Geo: GOal Modality Agnostic Active Geo-localization [49.599465495973654]
We consider the task of active geo-localization (AGL) in which an agent uses a sequence of visual cues observed during aerial navigation to find a target specified through multiple possible modalities.
GOMAA-Geo is a goal modality active geo-localization agent for zero-shot generalization between different goal modalities.
arXiv Detail & Related papers (2024-06-04T02:59:36Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - OpenGDA: Graph Domain Adaptation Benchmark for Cross-network Learning [42.48479966907126]
OpenGDA is a benchmark for evaluating graph domain adaptation models.
It provides abundant pre-processed and unified datasets for different types of tasks.
It integrates state-of-the-art models with standardized and end-to-end pipelines.
arXiv Detail & Related papers (2023-07-21T04:11:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.