Related papers: MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data

MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data

URL: http://arxiv.org/abs/2508.10894v2
Date: Thu, 09 Oct 2025 14:49:28 GMT
Title: MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data
Authors: Antoine Labatie, Michael Vaccaro, Nina Lardiere, Anatol Garioud, Nicolas Gonthier,
Abstract summary: We introduce MAESTRO, a novel adaptation of the Masked Autoencoder with optimized fusion mechanisms and a normalization scheme that incorporates a spectral prior as a self-supervisory signal.<n>We evaluate MAESTRO on four Earth observation datasets in both intra- and cross-dataset settings.
Score: 6.142054389646456
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-supervised learning holds great promise for remote sensing, but standard self-supervised methods must be adapted to the unique characteristics of Earth observation data. We take a step in this direction by conducting a comprehensive benchmark of fusion strategies and normalization schemes of reconstruction targets for multimodal, multitemporal, and multispectral Earth observation data. Based on our findings, we introduce MAESTRO, a novel adaptation of the Masked Autoencoder with optimized fusion mechanisms and a normalization scheme that incorporates a spectral prior as a self-supervisory signal. Evaluated on four Earth observation datasets in both intra- and cross-dataset settings, MAESTRO achieves state-of-the-art performance on tasks that strongly rely on multitemporal dynamics, while also remaining competitive on others. Code to reproduce all our experiments is available at https://github.com/ignf/maestro.

Related papers

Quantizing Space and Time: Fusing Time Series and Images for Earth Observation [4.012968772806928]
We propose a task-agnostic framework for multimodal fusion of time series and single timestamp images.<n>Our approach explores deterministic and learned strategies for time series quantization.<n>Our model generates consistent global temperature profiles from satellite imagery.
arXiv Detail & Related papers (2025-10-27T08:38:52Z)
UNO: Unified Self-Supervised Monocular Odometry for Platform-Agnostic Deployment [22.92093036869778]
We present UNO, a unified visual odometry framework that enables robust and pose estimation across diverse environments.<n>Our approach generalizes effectively across a wide range of real-world scenarios, including autonomous vehicles, aerial drones, mobile robots, and handheld devices.<n>We extensively evaluate our method on three major benchmark datasets.
arXiv Detail & Related papers (2025-06-08T06:30:37Z)
TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation [65.74990259650984]
We introduce TerraFM, a scalable self-supervised learning model that leverages globally distributed Sentinel-1 and Sentinel-2 imagery.<n>Our training strategy integrates local-global contrastive learning and introduces a dual-centering mechanism.<n>TerraFM achieves strong generalization on both classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench.
arXiv Detail & Related papers (2025-06-06T17:59:50Z)
EarthMind: Leveraging Cross-Sensor Data for Advanced Earth Observation Interpretation with a Unified Multimodal LLM [103.7537991413311]
Earth Observation (EO) data analysis is vital for monitoring environmental and human dynamics.<n>Recent Multimodal Large Language Models (MLLMs) show potential in EO understanding but remain restricted to single-sensor inputs.<n>We propose EarthMind, a unified vision-language framework that handles both single- and cross-sensor inputs.
arXiv Detail & Related papers (2025-06-02T13:36:05Z)
PyViT-FUSE: A Foundation Model for Multi-Sensor Earth Observation Data [0.2209921757303168]
We propose PyViT-FUSE, a foundation model for earth observation data explicitly designed to handle multi-modal imagery.<n>We train the model on a globally sampled dataset in a self-supervised manner, leveraging core concepts of the SwAV algorithm.<n>We show the interpretability of the fusion mechanism by visualization of the attention scores and the models applicability to downstream tasks.
arXiv Detail & Related papers (2025-04-26T02:34:33Z)
Whenever, Wherever: Towards Orchestrating Crowd Simulations with Spatio-Temporal Spawn Dynamics [65.72663487116439]
We propose nTPP-GMM that models spawn-temporal spawn dynamics using Neural Temporal Point Processes.<n>We evaluate our approach by simulations of three diverse real-world datasets with nTPP-GMM.
arXiv Detail & Related papers (2025-03-20T18:46:41Z)
EarthView: A Large Scale Remote Sensing Dataset for Self-Supervision [72.84868704100595]
This paper presents a dataset specifically designed for self-supervision on remote sensing data, intended to enhance deep learning applications on Earth monitoring tasks.<n>The dataset spans 15 tera pixels of global remote-sensing data, combining imagery from a diverse range of sources, including NEON, Sentinel, and a novel release of 1m spatial resolution data from Satellogic.<n>Accompanying the dataset is EarthMAE, a tailored Masked Autoencoder developed to tackle the distinct challenges of remote sensing data.
arXiv Detail & Related papers (2025-01-14T13:42:22Z)
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues [46.601134018876955]
We introduce EarthDial, a conversational assistant specifically designed for Earth Observation (EO) data.<n>EarthDial supports multi-spectral, multi-temporal, and multi-resolution imagery, enabling a wide range of remote sensing tasks.<n>Our experimental results on 44 downstream datasets demonstrate that EarthDial outperforms existing generic and domain-specific models.
arXiv Detail & Related papers (2024-12-19T18:57:13Z)
GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control [122.65089441381741]
We present GEM, a Generalizable Ego-vision Multimodal world model.<n>It predicts future frames using a reference frame, sparse features, human poses, and ego-trajectories.<n>Our dataset is comprised of 4000+ hours of multimodal data across domains like autonomous driving, egocentric human activities, and drone flights.
arXiv Detail & Related papers (2024-12-15T14:21:19Z)
SpectralEarth: Training Hyperspectral Foundation Models at Scale [47.93167977587301]
We introduce SpectralEarth, a large-scale multitemporal dataset designed to pretrain hyperspectral foundation models.<n>We pretrain a series of foundation models on SpectralEarth, integrating a spectral adapter into classical vision backbones.<n>In tandem, we construct nine downstream datasets for land-cover, crop-type mapping, and tree-species classification.
arXiv Detail & Related papers (2024-08-15T22:55:59Z)
Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation [48.66623377464203]
Our novel approach introduces the Dynamic One-For-All (DOFA) model, leveraging the concept of neural plasticity in brain science. This dynamic hypernetwork, adjusting to different wavelengths, enables a single versatile Transformer jointly trained on data from five sensors to excel across 12 distinct Earth observation tasks.
arXiv Detail & Related papers (2024-03-22T17:11:47Z)
An Unsupervised Short- and Long-Term Mask Representation for Multivariate Time Series Anomaly Detection [2.387411589813086]
This paper proposes an anomaly detection method based on unsupervised Short- and Long-term Mask Representation learning (SLMR) Experiments show that the performance of our method outperforms other state-of-the-art models on three real-world datasets.
arXiv Detail & Related papers (2022-08-19T09:34:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.