Context-Aware Multimodal Representation Learning for Spatio-Temporally Explicit Environmental Modelling
- URL: http://arxiv.org/abs/2511.11706v3
- Date: Thu, 20 Nov 2025 17:16:38 GMT
- Title: Context-Aware Multimodal Representation Learning for Spatio-Temporally Explicit Environmental Modelling
- Authors: Julia Peters, Karin Mora, Miguel D. Mahecha, Chaonan Ji, David Montero, Clemens Mosig, Guido Kraemer,
- Abstract summary: We propose a representation learning framework that integrates different modalities into unified space at high-temporal resolution.<n>Our approach produces a latent space at native 10 m resolution and the temporal frequency of cloud-free Sentinel-2 data.<n>This enables the model to capture complementary remote sensing data and to preserve coherence across space and time.
- Score: 3.3984815208531014
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Earth observation (EO) foundation models have emerged as an effective approach to derive latent representations of the Earth system from various remote sensing sensors. These models produce embeddings that can be used as analysis-ready datasets, enabling the modelling of ecosystem dynamics without extensive sensor-specific preprocessing. However, existing models typically operate at fixed spatial or temporal scales, limiting their use for ecological analyses that require both fine spatial detail and high temporal fidelity. To overcome these limitations, we propose a representation learning framework that integrates different EO modalities into a unified feature space at high spatio-temporal resolution. We introduce the framework using Sentinel-1 and Sentinel-2 data as representative modalities. Our approach produces a latent space at native 10 m resolution and the temporal frequency of cloud-free Sentinel-2 acquisitions. Each sensor is first modeled independently to capture its sensor-specific characteristics. Their representations are then combined into a shared model. This two-stage design enables modality-specific optimisation and easy extension to new sensors, retaining pretrained encoders while retraining only fusion layers. This enables the model to capture complementary remote sensing data and to preserve coherence across space and time. Qualitative analyses reveal that the learned embeddings exhibit high spatial and semantic consistency across heterogeneous landscapes. Quantitative evaluation in modelling Gross Primary Production reveals that they encode ecologically meaningful patterns and retain sufficient temporal fidelity to support fine-scale analyses. Overall, the proposed framework provides a flexible, analysis-ready representation learning approach for environmental applications requiring diverse spatial and temporal resolutions.
Related papers
- Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method [54.461213497603154]
Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities.<n>Nuplan-Occ is the largest occupancy dataset to date, constructed from the widely used Nuplan benchmark.<n>We develop a unified framework that jointly synthesizes high-quality occupancy, multi-view videos, and LiDAR point clouds.
arXiv Detail & Related papers (2025-10-27T03:52:45Z) - RainDiff: End-to-end Precipitation Nowcasting Via Token-wise Attention Diffusion [64.49056527678606]
We propose a Token-wise Attention integrated into not only the U-Net diffusion model but also the radar-temporal encoder.<n>Unlike prior approaches, our method integrates attention into the architecture without incurring the high resource cost typical of pixel-space diffusion.<n>Our experiments and evaluations demonstrate that the proposed method significantly outperforms state-of-the-art approaches, robustness local fidelity, generalization, and superior in complex precipitation forecasting scenarios.
arXiv Detail & Related papers (2025-10-16T17:59:13Z) - DFYP: A Dynamic Fusion Framework with Spectral Channel Attention and Adaptive Operator learning for Crop Yield Prediction [18.24061967822792]
DFYP is a novel Dynamic Fusion framework for crop Yield Prediction.<n>It combines spectral channel attention, edge-adaptive spatial modeling and a learnable fusion mechanism.<n> DFYP consistently outperforms current state-of-the-art baselines in RMSE, MAE, and R2.
arXiv Detail & Related papers (2025-07-08T10:24:04Z) - Holistic Surgical Phase Recognition with Hierarchical Input Dependent State Space Models [56.2236083600999]
We propose a novel hierarchical input-dependent state space model for surgical video analysis.<n>Our framework incorporates a temporally consistent visual feature extractor, which appends a state space model head to a visual feature extractor to propagate temporal information.<n> Experiments have shown that our method outperforms the current state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2025-06-26T14:43:57Z) - Spatial-Temporal-Spectral Unified Modeling for Remote Sensing Dense Prediction [20.1863553357121]
Current deep learning architectures for remote sensing are fundamentally rigid.<n>We introduce the Spatial-Temporal-Spectral Unified Network (STSUN) for unified modeling.<n> STSUN can adapt to input and output data with arbitrary spatial sizes, temporal lengths, and spectral bands.<n>It unifies various dense prediction tasks and diverse semantic class predictions.
arXiv Detail & Related papers (2025-05-18T07:39:17Z) - UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines [64.84631333071728]
We introduce bfUnistage, a unified Transformer-based framework fortemporal modeling.<n>Our work demonstrates that a task-specific vision-text can build a generalizable model fortemporal learning.<n>We also introduce a temporal module to incorporate temporal dynamics explicitly.
arXiv Detail & Related papers (2025-03-26T17:33:23Z) - Exploring Representation-Aligned Latent Space for Better Generation [86.45670422239317]
We introduce ReaLS, which integrates semantic priors to improve generation performance.<n>We show that fundamental DiT and SiT trained on ReaLS can achieve a 15% improvement in FID metric.<n>The enhanced semantic latent space enables more perceptual downstream tasks, such as segmentation and depth estimation.
arXiv Detail & Related papers (2025-02-01T07:42:12Z) - ST-ReP: Learning Predictive Representations Efficiently for Spatial-Temporal Forecasting [7.637123047745445]
Self-supervised methods are increasingly adapted to learn spatial-temporal representations.<n>Current value reconstruction and future value prediction are integrated into the pre-training framework.<n>Multi-time scale analysis is incorporated into the self-supervised loss to enhance predictive capability.
arXiv Detail & Related papers (2024-12-19T05:33:55Z) - DynST: Dynamic Sparse Training for Resource-Constrained Spatio-Temporal Forecasting [31.398965880415492]
Earth science systems rely heavily on the extensive deployment of sensors.<n>Traditional approaches to sensor deployment utilize specific algorithms to design and deploy sensors.<n>We introduce for the first time the concept of sparse-temporal data dynamic sparse training and are committed to adaptively, dynamically filtering important distributions sensor.
arXiv Detail & Related papers (2024-03-05T12:31:24Z) - FREE: The Foundational Semantic Recognition for Modeling Environmental Ecosystems [56.0640340392818]
We introduce a framework, FREE, that enables the use of varying features and available information to train a universal model.<n>The core idea is to map available environmental data into a text space and then convert the traditional predictive modeling task in environmental science to a semantic recognition problem.<n>Our evaluation on two societally important real-world applications, stream water temperature prediction and crop yield prediction, demonstrates the superiority of FREE over multiple baselines.
arXiv Detail & Related papers (2023-11-17T00:53:09Z) - A spatio-temporal LSTM model to forecast across multiple temporal and
spatial scales [0.0]
This paper presents a novel-temporal LSTM (SPATIAL) architecture for time series forecasting applied to environmental datasets.
The framework was evaluated across multiple sensors and for three different oceanic variables: current speed, temperature, and dissolved oxygen.
arXiv Detail & Related papers (2021-08-26T16:07:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.