Related papers: OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data

OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data

URL: http://arxiv.org/abs/2505.23522v2
Date: Tue, 04 Nov 2025 12:55:32 GMT
Title: OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data
Authors: Fengxiang Wang, Mingshuo Chen, Xuming He, Yueying Li, YiFan Zhang, Feng Liu, Zijie Guo, Zhenghao Hu, Jiong Wang, Jingyi Xu, Zhangrui Li, Fenghua Ling, Ben Fei, Weijia Li, Long Lan, Wenjing Yang, Wenlong Zhang, Lei Bai,
Abstract summary: Existing benchmarks for multimodal learning in Earth science offer limited, siloed coverage of Earth's spheres and their cross-sphere interactions.<n>We introduce textbf OmniEarth-Bench, the first multimodal benchmark that systematically spans all six spheres.<n>Built with a scalable, modular-topology data inference framework and native multi-observation sources, OmniEarth-Bench produces 29,855 standardized, expert-curated annotations.
Score: 72.98496934729245
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing benchmarks for multimodal learning in Earth science offer limited, siloed coverage of Earth's spheres and their cross-sphere interactions, typically restricting evaluation to the human-activity sphere of atmosphere and to at most 16 tasks. These limitations: \textit{narrow-source heterogeneity (single/few data sources), constrained scientific granularity, and limited-sphere extensibility}. Therefore, we introduce \textbf{OmniEarth-Bench}, the first multimodal benchmark that systematically spans all six spheres: atmosphere, lithosphere, oceanosphere, cryosphere, biosphere, and human-activity sphere, and cross-spheres. Built with a scalable, modular-topology data inference framework and native multi-observation sources and expert-in-the-loop curation, OmniEarth-Bench produces 29,855 standardized, expert-curated annotations. All annotations are organized into a four-level hierarchy (Sphere, Scenario, Ability, Task), encompassing 109 expert-curated evaluation tasks. Experiments on 9 state-of-the-art MLLMs reveal that even the most advanced models struggle with our benchmarks, where none of them reach 35\% accuracy, revealing systematic gaps in Earth-system cognitive ability. The dataset and evaluation code were released at OmniEarth-Bench (https://anonymous.4open.science/r/OmniEarth-Bench-B1BD).

Related papers

Breaking the Regional Barrier: Inductive Semantic Topology Learning for Worldwide Air Quality Forecasting [99.4484686548807]
We propose OmniAir, a semantic topology learning framework tailored for global station-level prediction.<n>Our approach effectively captures long-range non-Euclidean correlations and physical diffusion patterns across unevenly distributed global networks.<n>Experiments show that OmniAir achieves state-of-the-art performance against 18 baselines, maintaining high efficiency and scalability with speeds nearly 10 times faster than existing models.
arXiv Detail & Related papers (2026-01-29T15:58:07Z)
OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation [68.10925029626709]
OlmoEarth is a multimodal, sequential-temporal foundation model designed for the Earth observation domain.<n>OlmoEarth achieves state-of-the-art performance compared to 12 other foundation models.<n>We deploy OlmoEarth as the backbone of an end-to-end platform for data collection, labeling, training observation and inference of Earth observation models.
arXiv Detail & Related papers (2025-11-17T18:06:26Z)
TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation [65.74990259650984]
We introduce TerraFM, a scalable self-supervised learning model that leverages globally distributed Sentinel-1 and Sentinel-2 imagery.<n>Our training strategy integrates local-global contrastive learning and introduces a dual-centering mechanism.<n>TerraFM achieves strong generalization on both classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench.
arXiv Detail & Related papers (2025-06-06T17:59:50Z)
EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models [96.18182289276649]
We present EarthMind, a novel vision-language framework for multi-granular and multi-sensor Earth Observation (EO) data understanding.<n>EarthMind features two core components: (1) Spatial Attention Prompting (SAP), which reallocates attention within the LLM to enhance pixel-level understanding; and (2) Cross-modal Fusion, which aligns heterogeneous modalities into a shared space.<n>To facilitate multi-sensor fusion evaluation, we propose EarthMind-Bench, a comprehensive benchmark with over 2,000 human-annotated multi-sensor image-question pairs.
arXiv Detail & Related papers (2025-06-02T13:36:05Z)
MSEarth: A Benchmark for Multimodal Scientific Comprehension of Earth Science [39.224598950224454]
We introduce MSEarth, a multimodal scientific benchmark curated from high-quality, open-access scientific publications.<n>MSEarth encompasses the five major spheres of Earth science: atmosphere, cryosphere, hydrosphere, lithosphere, and biosphere.<n>MSEarth supports a variety of tasks, including scientific figure captioning, multiple choice questions, and open-ended reasoning challenges.
arXiv Detail & Related papers (2025-05-27T05:30:35Z)
EarthSE: A Benchmark Evaluating Earth Scientific Exploration Capability for Large Language Models [36.72915099998998]
We present a professional benchmark for the Earth sciences, designed to evaluate the capabilities of Large Language Models (LLMs) in scientific exploration.<n>Leveraging a corpus of 100,000 research papers, we first construct two Question Answering (QA) datasets: Earth-Iron and Earth-Silver.<n>These datasets encompass five Earth spheres, 114 disciplines, and 11 task categories, assessing knowledge crucial for scientific exploration.
arXiv Detail & Related papers (2025-05-22T06:46:08Z)
TerraMesh: A Planetary Mosaic of Multimodal Earth Observation Data [3.674991996196602]
We introduce TerraMesh, a new globally diverse, multimodal dataset combining optical, radar, elevation, and land-cover modalities in a single format.<n>We provide detailed data processing steps, comprehensive statistics, and empirical evidence demonstrating improved model performance when pre-trained on TerraMesh.<n>The dataset will be made publicly available with a permissive license.
arXiv Detail & Related papers (2025-04-15T13:20:35Z)
EarthView: A Large Scale Remote Sensing Dataset for Self-Supervision [72.84868704100595]
This paper presents a dataset specifically designed for self-supervision on remote sensing data, intended to enhance deep learning applications on Earth monitoring tasks.<n>The dataset spans 15 tera pixels of global remote-sensing data, combining imagery from a diverse range of sources, including NEON, Sentinel, and a novel release of 1m spatial resolution data from Satellogic.<n>Accompanying the dataset is EarthMAE, a tailored Masked Autoencoder developed to tackle the distinct challenges of remote sensing data.
arXiv Detail & Related papers (2025-01-14T13:42:22Z)
PEACE: Empowering Geologic Map Holistic Understanding with MLLMs [64.58959634712215]
Geologic map, as a fundamental diagram in geology science, provides critical insights into the structure and composition of Earth's subsurface and surface.<n>Despite their significance, current Multimodal Large Language Models (MLLMs) often fall short in geologic map understanding.<n>To quantify this gap, we construct GeoMap-Bench, the first-ever benchmark for evaluating MLLMs in geologic map understanding.
arXiv Detail & Related papers (2025-01-10T18:59:42Z)
Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation [48.66623377464203]
Our novel approach introduces the Dynamic One-For-All (DOFA) model, leveraging the concept of neural plasticity in brain science. This dynamic hypernetwork, adjusting to different wavelengths, enables a single versatile Transformer jointly trained on data from five sensors to excel across 12 distinct Earth observation tasks.
arXiv Detail & Related papers (2024-03-22T17:11:47Z)
Earthformer: Exploring Space-Time Transformers for Earth System Forecasting [27.60569643222878]
We propose Earthformer, a space-time Transformer for Earth system forecasting. The Transformer is based on a generic, flexible and efficient space-time attention block, named Cuboid Attention. Experiments on two real-world benchmarks about precipitation nowcasting and El Nino/Southerntemporaltion show Earthformer achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-07-12T20:52:26Z)
Benchmarking Unsupervised Object Representations for Video Sequences [111.81492107649889]
We compare the perceptual abilities of four object-centric approaches: ViMON, OP3, TBA and SCALOR. Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking. Our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
arXiv Detail & Related papers (2020-06-12T09:37:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.