Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation
- URL: http://arxiv.org/abs/2403.15356v3
- Date: Wed, 15 Oct 2025 22:11:23 GMT
- Title: Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation
- Authors: Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J. Stewart, Joƫlle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, Xiao Xiang Zhu,
- Abstract summary: We propose a unified, multimodal foundation framework designed for diverse vision tasks in Earth observation (EO)<n>Inspired by neural plasticity, DOFA utilizes a wavelength-conditioned dynamic hypernetwork to process inputs from five distinct satellite sensors flexibly.<n>We show DOFA's potential as a foundation for general-purpose vision models in the sensor-diverse EO domain.
- Score: 47.52225194259896
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Earth observation (EO) in open-world settings presents a unique challenge: different applications rely on diverse sensor modalities, each with varying ground sampling distances, spectral ranges, and numbers of spectral bands. However, existing EO foundation models are typically tailored to specific sensor types, making them inflexible when generalizing across the heterogeneous landscape of EO data. To address this, we propose the Dynamic One-For-All (DOFA) model, a unified, multimodal foundation framework designed for diverse vision tasks in EO. Inspired by neural plasticity, DOFA utilizes a wavelength-conditioned dynamic hypernetwork to process inputs from five distinct satellite sensors flexibly. By continually pretraining on five EO modalities, DOFA achieves state-of-the-art performance across multiple downstream tasks and generalizes well to unseen modalities. Enhanced with hybrid continual pretraining, DOFA+ requires significantly fewer computational resources while outperforming counterparts trained with extensive GPU budgets. Experiments on diverse datasets highlight DOFA's potential as a foundation for general-purpose vision models in the sensor-diverse EO domain. The code and pre-trained weights are publicly available at https://github.com/zhu-xlab/DOFA.
Related papers
- EO-VAE: Towards A Multi-sensor Tokenizer for Earth Observation Data [19.18955300820542]
State-of-the-art generative image and video models rely heavily on tokenizers that compress high-dimensional inputs into more efficient latent representations.<n>We propose EO-VAE, a multi-sensor variational autoencoder designed to serve as a foundational tokenizer for the Earth observation domain.
arXiv Detail & Related papers (2026-02-12T17:09:14Z) - DeFM: Learning Foundation Representations from Depth for Robotics [49.77188649197404]
We present DeFM, a self-supervised foundation model trained entirely on depth images for robotic applications.<n>DeFM learns geometric and semantic representations that generalize to diverse environments, tasks, and sensors.<n>It achieves state-of-the-art performance and demonstrates strong generalization from simulation to real-world environments.
arXiv Detail & Related papers (2026-01-26T19:45:31Z) - The View From Space: Navigating Instrumentation Differences with EOFMs [0.0]
Earth Observation Foundation Models (EOFMs) have exploded in prevalence as tools for processing the massive volumes of remotely sensed and other earth observation data, and for delivering impact on the many essential earth monitoring tasks.<n>An emerging trend posits using the outputs of pre-trained models as 'embeddings' which summarize high dimensional data to be used for generic tasks such as similarity search and content-specific queries.<n>Most EOFM models are trained only on single modalities of data and then applied or benchmarked by matching bands across different modalities.<n>It is not clear from existing work what impact diverse sensor architectures have on the internal representations
arXiv Detail & Related papers (2025-10-01T00:53:45Z) - Towards Scalable and Generalizable Earth Observation Data Mining via Foundation Model Composition [0.0]
We investigate whether foundation models pretrained on remote sensing and general vision datasets can be effectively combined to improve performance.<n>The results show that feature-level ensembling of smaller pretrained models can match or exceed the performance of much larger models.<n>The study highlights the potential of applying knowledge distillation to transfer the strengths of ensembles into more compact models.
arXiv Detail & Related papers (2025-06-25T07:02:42Z) - EarthMind: Leveraging Cross-Sensor Data for Advanced Earth Observation Interpretation with a Unified Multimodal LLM [103.7537991413311]
Earth Observation (EO) data analysis is vital for monitoring environmental and human dynamics.<n>Recent Multimodal Large Language Models (MLLMs) show potential in EO understanding but remain restricted to single-sensor inputs.<n>We propose EarthMind, a unified vision-language framework that handles both single- and cross-sensor inputs.
arXiv Detail & Related papers (2025-06-02T13:36:05Z) - Efficient Self-Supervised Learning for Earth Observation via Dynamic Dataset Curation [67.23953699167274]
Self-supervised learning (SSL) has enabled the development of vision foundation models for Earth Observation (EO)
In EO, this challenge is amplified by the redundancy and heavy-tailed distributions common in satellite imagery.
We propose a dynamic dataset pruning strategy designed to improve SSL pre-training by maximizing dataset diversity and balance.
arXiv Detail & Related papers (2025-04-09T15:13:26Z) - FreSca: Scaling in Frequency Space Enhances Diffusion Models [55.75504192166779]
This paper explores frequency-based control within latent diffusion models.<n>We introduce FreSca, a novel framework that decomposes noise difference into low- and high-frequency components.<n>FreSca operates without any model retraining or architectural change, offering model- and task-agnostic control.
arXiv Detail & Related papers (2025-04-02T22:03:11Z) - DOFA-CLIP: Multimodal Vision-Language Foundation Models for Earth Observation [27.878058177228727]
We present DOFA-CLIP, a vision-language foundation model that adapts to EO modalities with flexible spectral configurations through a single Transformer backbone.<n>Our approach introduces three key contributions: 1) the construction of GeoLangBind-2M, a large-scale EO image-text dataset covering six heterogeneous modalities with rich natural language descriptions; 2) a novel training strategy called VECT, which enhances the spatial awareness of CLIP features with multiple vision foundation models; and 3) a Modality-aware Knowledge Agglomeration (MaKA) module that refines feature distillation with modality-specific awareness.
arXiv Detail & Related papers (2025-03-08T19:10:04Z) - Trajectory World Models for Heterogeneous Environments [67.27233466954814]
Heterogeneity in sensors and actuators across environments poses a significant challenge to building large-scale pre-trained world models.
We introduce UniTraj, a unified dataset comprising over one million trajectories from 80 environments, designed to scale data while preserving critical diversity.
We propose TrajWorld, a novel architecture capable of flexibly handling varying sensor and actuator information and capturing environment dynamics in-context.
arXiv Detail & Related papers (2025-02-03T13:59:08Z) - Exploring Representation-Aligned Latent Space for Better Generation [86.45670422239317]
We introduce ReaLS, which integrates semantic priors to improve generation performance.<n>We show that fundamental DiT and SiT trained on ReaLS can achieve a 15% improvement in FID metric.<n>The enhanced semantic latent space enables more perceptual downstream tasks, such as segmentation and depth estimation.
arXiv Detail & Related papers (2025-02-01T07:42:12Z) - PhysAug: A Physical-guided and Frequency-based Data Augmentation for Single-Domain Generalized Object Detection [4.592579302639643]
Single-Domain Generalized Object Detection(S-DGOD) aims to train on a single source domain for robust performance across a variety of unseen target domains by taking advantage of an object detector.<n>Existing S-DGOD approaches often rely on data augmentation strategies, including a composition of visual transformations, to enhance the detector's generalization ability.<n>We propose PhysAug, a novel physical model-based non-ideal imaging condition data augmentation method, to enhance the adaptability of the S-DGOD tasks.
arXiv Detail & Related papers (2024-12-16T14:18:01Z) - On Foundation Models for Dynamical Systems from Purely Synthetic Data [5.004576576202551]
Foundation models have demonstrated remarkable generalization, data efficiency, and robustness properties across various domains.
These models are available in fields like natural language processing and computer vision, but do not exist for dynamical systems.
We address this challenge by pretraining a transformer-based foundation model exclusively on synthetic data.
Our results demonstrate the feasibility of foundation models for dynamical systems that outperform specialist models in terms of generalization, data efficiency, and robustness.
arXiv Detail & Related papers (2024-11-30T08:34:10Z) - Foundation Models for Remote Sensing and Earth Observation: A Survey [101.77425018347557]
This survey systematically reviews the emerging field of Remote Sensing Foundation Models (RSFMs)
It begins with an outline of their motivation and background, followed by an introduction of their foundational concepts.
We benchmark these models against publicly available datasets, discuss existing challenges, and propose future research directions.
arXiv Detail & Related papers (2024-10-22T01:08:21Z) - Multimodal Flare Forecasting with Deep Learning [0.2968738145616401]
We employ deep learning to compare the predictive capabilities of chromospheric and coronal UV and EUV emissions across different wavelengths.
Our findings indicate that individual EUV wavelengths can provide discriminatory power comparable or better to that of line-of-sight magnetograms.
arXiv Detail & Related papers (2024-10-21T15:42:47Z) - Back to Bayesics: Uncovering Human Mobility Distributions and Anomalies with an Integrated Statistical and Neural Framework [14.899157568336731]
DeepBayesic is a novel framework that integrates Bayesian principles with deep neural networks to model the underlying distributions.
We evaluate our approach on several mobility datasets, demonstrating significant improvements over state-of-the-art anomaly detection methods.
arXiv Detail & Related papers (2024-10-01T19:02:06Z) - SpectralEarth: Training Hyperspectral Foundation Models at Scale [47.93167977587301]
We introduce SpectralEarth, a large-scale multi-temporal dataset designed to pretrain hyperspectral foundation models.
We pretrain a series of foundation models on SpectralEarth using state-of-the-art self-supervised learning (SSL) algorithms.
We construct four downstream datasets for land-cover and crop-type mapping, providing benchmarks for model evaluation.
arXiv Detail & Related papers (2024-08-15T22:55:59Z) - MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - Synthetic location trajectory generation using categorical diffusion
models [50.809683239937584]
Diffusion models (DPMs) have rapidly evolved to be one of the predominant generative models for the simulation of synthetic data.
We propose using DPMs for the generation of synthetic individual location trajectories (ILTs) which are sequences of variables representing physical locations visited by individuals.
arXiv Detail & Related papers (2024-02-19T15:57:39Z) - Dynamic Inertial Poser (DynaIP): Part-Based Motion Dynamics Learning for
Enhanced Human Pose Estimation with Sparse Inertial Sensors [17.3834029178939]
This paper introduces a novel human pose estimation approach using sparse inertial sensors.
It leverages a diverse array of real inertial motion capture data from different skeleton formats to improve motion diversity and model generalization.
The approach demonstrates superior performance over state-of-the-art models across five public datasets, notably reducing pose error by 19% on the DIP-IMU dataset.
arXiv Detail & Related papers (2023-12-02T13:17:10Z) - Foundation Models for Generalist Geospatial Artificial Intelligence [3.7002058945990415]
This paper introduces a first-of-a-kind framework for the efficient pre-training and fine-tuning of foundational models on extensive data.
We have utilized this framework to create Prithvi, a transformer-based foundational model pre-trained on more than 1TB of multispectral satellite imagery.
arXiv Detail & Related papers (2023-10-28T10:19:55Z) - VTAE: Variational Transformer Autoencoder with Manifolds Learning [144.0546653941249]
Deep generative models have demonstrated successful applications in learning non-linear data distributions through a number of latent variables.
The nonlinearity of the generator implies that the latent space shows an unsatisfactory projection of the data space, which results in poor representation learning.
We show that geodesics and accurate computation can substantially improve the performance of deep generative models.
arXiv Detail & Related papers (2023-04-03T13:13:19Z) - TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks.
To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame.
Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.