TerraMind: Large-Scale Generative Multimodality for Earth Observation
- URL: http://arxiv.org/abs/2504.11171v1
- Date: Tue, 15 Apr 2025 13:17:39 GMT
- Title: TerraMind: Large-Scale Generative Multimodality for Earth Observation
- Authors: Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, Rahul Ramachandran, Paolo Fraccaro, Thomas Brunschwiler, Gabriele Cavallaro, Juan Bernabe-Moreno, Nicolas Longépé,
- Abstract summary: We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation.<n>Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data.
- Score: 3.5472166810202457
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code is open-sourced under a permissive license.
Related papers
- TerraMesh: A Planetary Mosaic of Multimodal Earth Observation Data [3.674991996196602]
We introduce TerraMesh, a new globally diverse, multimodal dataset combining optical, radar, elevation, and land-cover modalities in a single format.
We provide detailed data processing steps, comprehensive statistics, and empirical evidence demonstrating improved model performance when pre-trained on TerraMesh.
The dataset will be made publicly available with a permissive license.
arXiv Detail & Related papers (2025-04-15T13:20:35Z) - MESA: Text-Driven Terrain Generation Using Latent Diffusion and Global Copernicus Data [0.0]
We present MESA - a novel data-centric alternative to procedural terrain modeling.<n>MESA generates high-quality terrain samples from text descriptions using global remote sensing data.<n>The model's capabilities are demonstrated through extensive experiments, highlighting its ability to generate realistic and diverse terrain landscapes.
arXiv Detail & Related papers (2025-04-09T18:37:24Z) - OmniGeo: Towards a Multimodal Large Language Models for Geospatial Artificial Intelligence [51.0456395687016]
multimodal large language models (LLMs) have opened new frontiers in artificial intelligence.<n>We propose a MLLM (OmniGeo) tailored to geospatial applications.<n>By combining the strengths of natural language understanding and spatial reasoning, our model enhances the ability of instruction following and the accuracy of GeoAI systems.
arXiv Detail & Related papers (2025-03-20T16:45:48Z) - Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control [97.98560001760126]
We introduce Cosmos-Transfer, a conditional world generation model that can generate world simulations based on multiple spatial control inputs.
We conduct evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics2Real and autonomous vehicle data enrichment.
arXiv Detail & Related papers (2025-03-18T17:57:54Z) - GeoLangBind: Unifying Earth Observation with Agglomerative Vision-Language Foundation Models [27.878058177228727]
GeoLangBind is a novel agglomerative vision--language foundation model.
It bridges the gap between heterogeneous EO data modalities using language as a unifying medium.
Our approach aligns different EO data types into a shared language embedding space.
arXiv Detail & Related papers (2025-03-08T19:10:04Z) - EarthView: A Large Scale Remote Sensing Dataset for Self-Supervision [72.84868704100595]
This paper presents a dataset specifically designed for self-supervision on remote sensing data, intended to enhance deep learning applications on Earth monitoring tasks.<n>The dataset spans 15 tera pixels of global remote-sensing data, combining imagery from a diverse range of sources, including NEON, Sentinel, and a novel release of 1m spatial resolution data from Satellogic.<n>Accompanying the dataset is EarthMAE, a tailored Masked Autoencoder developed to tackle the distinct challenges of remote sensing data.
arXiv Detail & Related papers (2025-01-14T13:42:22Z) - MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.
The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - WorldGPT: Empowering LLM as Multimodal World Model [51.243464216500975]
We introduce WorldGPT, a generalist world model built upon Multimodal Large Language Model (MLLM)
WorldGPT acquires an understanding of world dynamics through analyzing millions of videos across various domains.
We conduct evaluations on WorldNet, a multimodal state transition prediction benchmark.
arXiv Detail & Related papers (2024-04-28T14:42:02Z) - OmniSat: Self-Supervised Modality Fusion for Earth Observation [5.767156832161819]
We introduce OmniSat, a novel architecture able to merge diverse EO modalities into expressive features without labels.
As demonstrated for three downstream tasks, OmniSat can learn rich representations without supervision, leading to state-of-the-art performances.
Our multimodal pretraining scheme improves performance even when only one modality is available for inference.
arXiv Detail & Related papers (2024-04-12T09:31:55Z) - T1: Scaling Diffusion Probabilistic Fields to High-Resolution on Unified
Visual Modalities [69.16656086708291]
Diffusion Probabilistic Field (DPF) models the distribution of continuous functions defined over metric spaces.
We propose a new model comprising of a view-wise sampling algorithm to focus on local structure learning.
The model can be scaled to generate high-resolution data while unifying multiple modalities.
arXiv Detail & Related papers (2023-05-24T03:32:03Z) - Earthformer: Exploring Space-Time Transformers for Earth System
Forecasting [27.60569643222878]
We propose Earthformer, a space-time Transformer for Earth system forecasting.
The Transformer is based on a generic, flexible and efficient space-time attention block, named Cuboid Attention.
Experiments on two real-world benchmarks about precipitation nowcasting and El Nino/Southerntemporaltion show Earthformer achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-07-12T20:52:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.