Related papers: EarthMind: Leveraging Cross-Sensor Data for Advanced Earth Observation Interpretation with a Unified Multimodal LLM

EarthMind: Leveraging Cross-Sensor Data for Advanced Earth Observation Interpretation with a Unified Multimodal LLM

URL: http://arxiv.org/abs/2506.01667v2
Date: Sun, 28 Sep 2025 12:14:53 GMT
Title: EarthMind: Leveraging Cross-Sensor Data for Advanced Earth Observation Interpretation with a Unified Multimodal LLM
Authors: Yan Shu, Bin Ren, Zhitong Xiong, Danda Pani Paudel, Luc Van Gool, Begüm Demir, Nicu Sebe, Paolo Rota,
Abstract summary: Earth Observation (EO) data analysis is vital for monitoring environmental and human dynamics.<n>Recent Multimodal Large Language Models (MLLMs) show potential in EO understanding but remain restricted to single-sensor inputs.<n>We propose EarthMind, a unified vision-language framework that handles both single- and cross-sensor inputs.
Score: 103.7537991413311
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Earth Observation (EO) data analysis is vital for monitoring environmental and human dynamics. Recent Multimodal Large Language Models (MLLMs) show potential in EO understanding but remain restricted to single-sensor inputs, overlooking the complementarity across heterogeneous modalities. We propose EarthMind, a unified vision-language framework that handles both single- and cross-sensor inputs via an innovative hierarchical cross-modal attention (ie, HCA) design. Specifically, HCA hierarchically captures visual relationships across sensors and aligns them with language queries, enabling adaptive fusion of optical and Synthetic Aperture Radar (SAR) features. To support cross-sensor learning, we curate FusionEO, a 30K-pair dataset with diverse annotations, and establish EarthMind-Bench, a 2,841-pair benchmark with expert annotations for perception and reasoning tasks. Extensive experiments show that EarthMind achieves state-of-the-art results on EarthMind-Bench and surpasses existing MLLMs on multiple EO benchmarks.

Related papers

Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents [49.3216026940601]
Earth observation is essential for understanding the states of the Earth system.<n>Recent MLLMs have advanced EO research, but they still lack the capability to tackle complex tasks that require multi-step reasoning.<n>We introduce Earth-Agent, the first agentic framework that unifies RGB and spectral EO data within an MCP-based tool ecosystem.
arXiv Detail & Related papers (2025-09-27T06:04:28Z)
OSDA: A Framework for Open-Set Discovery and Automatic Interpretation of Land-cover in Remote Sensing Imagery [10.196580289786414]
Open-set land-cover analysis in remote sensing requires the ability to achieve fine-grained spatial localization and semantically open categorization.<n>We introduce OSDA, an integrated three-stage framework for annotation-free open-set land-cover discovery, segmentation, and description.<n>Our work provides a scalable and interpretable solution for dynamic land-cover monitoring, showing strong potential for automated cartographic updating and large-scale earth observation analysis.
arXiv Detail & Related papers (2025-09-23T06:23:56Z)
MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data [6.142054389646456]
We introduce MAESTRO, a novel adaptation of the Masked Autoencoder with optimized fusion mechanisms and a normalization scheme that incorporates a spectral prior as a self-supervisory signal.<n>We evaluate MAESTRO on four Earth observation datasets in both intra- and cross-dataset settings.
arXiv Detail & Related papers (2025-08-14T17:58:45Z)
TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation [65.74990259650984]
We introduce TerraFM, a scalable self-supervised learning model that leverages globally distributed Sentinel-1 and Sentinel-2 imagery.<n>Our training strategy integrates local-global contrastive learning and introduces a dual-centering mechanism.<n>TerraFM achieves strong generalization on both classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench.
arXiv Detail & Related papers (2025-06-06T17:59:50Z)
OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data [42.73179312287478]
We introduce OmniEarth-Bench, the first comprehensive multimodal benchmark spanning all six Earth science spheres.<n>It integrates 29,779 annotations across four tiers: perception, general reasoning, scientific knowledge reasoning and chain-of-thought reasoning.<n>Experiments reveal that even the most advanced models struggle with our benchmarks, where none of them reach 35% accuracy.
arXiv Detail & Related papers (2025-05-29T15:02:27Z)
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models [70.41727912081463]
Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images.<n>We propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception.<n>Our model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable, generalizable multi-frame reasoning.
arXiv Detail & Related papers (2025-05-22T17:59:39Z)
SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding [64.15606979785355]
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored.<n>This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities?
arXiv Detail & Related papers (2025-05-22T17:59:03Z)
EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery [15.581788175591097]
It is challenging to adapt natural spatial models to remote sensing imagery.<n>EarthGPT-X offers zoom-in and zoom-out insight, and possesses flexible multi-grained interactive abilities.<n>Experiments conducted demonstrate the superiority of the proposed EarthGPT-X in multi-grained tasks.
arXiv Detail & Related papers (2025-04-17T09:56:35Z)
TerraMind: Large-Scale Generative Multimodality for Earth Observation [3.5472166810202457]
We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation.<n>Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data.
arXiv Detail & Related papers (2025-04-15T13:17:39Z)
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues [46.601134018876955]
We introduce EarthDial, a conversational assistant specifically designed for Earth Observation (EO) data.<n>EarthDial supports multi-spectral, multi-temporal, and multi-resolution imagery, enabling a wide range of remote sensing tasks.<n>Our experimental results on 44 downstream datasets demonstrate that EarthDial outperforms existing generic and domain-specific models.
arXiv Detail & Related papers (2024-12-19T18:57:13Z)
FedMLLM: Federated Fine-tuning MLLM on Multimodal Heterogeneity Data [56.08867996209236]
Fine-tuning Multimodal Large Language Models (MLLMs) with Federated Learning (FL) allows for expanding the training data scope by including private data sources.<n>We introduce a benchmark to evaluate the performance of federated fine-tuning of MLLMs across various multimodal heterogeneous scenarios.<n>We develop a general FedMLLM framework that integrates classic FL methods alongside two modality-agnostic strategies.
arXiv Detail & Related papers (2024-11-22T04:09:23Z)
Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models [55.20626448358655]
This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs) Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image. For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence.
arXiv Detail & Related papers (2023-11-07T08:27:32Z)
Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth Estimation in Dynamic Scenes [51.20150148066458]
We propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the generalizationally crafted masks. Experiments on real-world datasets prove the significant effectiveness and ability of the proposed method.
arXiv Detail & Related papers (2023-04-18T13:55:24Z)
MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis [48.776247141839875]
We propose a novel framework, MISA, which projects each modality to two distinct subspaces. The first subspace is modality-invariant, where the representations across modalities learn their commonalities and reduce the modality gap. Our experiments on popular sentiment analysis benchmarks, MOSI and MOSEI, demonstrate significant gains over state-of-the-art models.
arXiv Detail & Related papers (2020-05-07T15:13:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.