SatMamba: Development of Foundation Models for Remote Sensing Imagery Using State Space Models
- URL: http://arxiv.org/abs/2502.00435v1
- Date: Sat, 01 Feb 2025 14:07:21 GMT
- Title: SatMamba: Development of Foundation Models for Remote Sensing Imagery Using State Space Models
- Authors: Chuc Man Duc, Hiromichi Fukui,
- Abstract summary: Foundation models refer to deep learning models pretrained on large unlabeled datasets through self-supervised algorithms.
Various foundation models have been developed for remote sensing, such as those for multispectral, high-resolution, and hyperspectral images.
This research proposes SatMamba, a new pretraining framework that combines masked autoencoders with State Space Model.
- Score: 0.0
- License:
- Abstract: Foundation models refer to deep learning models pretrained on large unlabeled datasets through self-supervised algorithms. In the Earth science and remote sensing communities, there is growing interest in transforming the use of Earth observation data, including satellite and aerial imagery, through foundation models. Various foundation models have been developed for remote sensing, such as those for multispectral, high-resolution, and hyperspectral images, and have demonstrated superior performance on various downstream tasks compared to traditional supervised models. These models are evolving rapidly, with capabilities to handle multispectral, multitemporal, and multisensor data. Most studies use masked autoencoders in combination with Vision Transformers (ViTs) as the backbone for pretraining. While the models showed promising performance, ViTs face challenges, such as quadratic computational scaling with input length, which may limit performance on multiband and multitemporal data with long sequences. This research aims to address these challenges by proposing SatMamba, a new pretraining framework that combines masked autoencoders with State Space Model, offering linear computational scaling. Experiments on high-resolution imagery across various downstream tasks show promising results, paving the way for more efficient foundation models and unlocking the full potential of Earth observation data. The source code is available in https://github.com/mdchuc/HRSFM.
Related papers
- SatVision-TOA: A Geospatial Foundation Model for Coarse-Resolution All-Sky Remote Sensing Imagery [8.096413986108601]
We introduce SatVision-TOA, a novel foundation model pre-trained on 14-band MODIS L1B Top-Of-Atmosphere (TOA) radiance imagery.
The SatVision-TOA model is pre-trained using a Masked-Image-Modeling (MIM) framework and the SwinV2 architecture.
Results show that SatVision-TOA achieves superior performance over baseline methods on downstream tasks.
arXiv Detail & Related papers (2024-11-26T00:08:00Z) - A Mamba Foundation Model for Time Series Forecasting [13.593170999506889]
We introduce TSMamba, a linear-complexity foundation model for time series forecasting built on the Mamba architecture.
The model captures temporal dependencies through both forward and backward Mamba encoders, achieving high prediction accuracy.
It also achieves competitive or superior full-shot performance compared to task-specific prediction models.
arXiv Detail & Related papers (2024-11-05T09:34:05Z) - Seeing Through the Clouds: Cloud Gap Imputation with Prithvi Foundation Model [1.2374541748245838]
We compare a Vision Transformer (ViT) model with a baseline Conditional Generative Adversarial Network (CGAN) model for missing value imputation in time series of multispectral satellite imagery.
We randomly mask time series of satellite images using real-world cloud masks and train each model to reconstruct the missing pixels.
arXiv Detail & Related papers (2024-04-30T15:03:27Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Rethinking Transformers Pre-training for Multi-Spectral Satellite
Imagery [78.43828998065071]
Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks.
Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data.
In this paper, we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities.
arXiv Detail & Related papers (2024-03-08T16:18:04Z) - SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery [35.550999964460466]
We present SkySense, a generic billion-scale model, pre-trained on a curated multi-modal Remote Sensing dataset with 21.5 million temporal sequences.
To our best knowledge, SkySense is the largest Multi-Modal to date, whose modules can be flexibly combined or used individually to accommodate various tasks.
arXiv Detail & Related papers (2023-12-15T09:57:21Z) - DiffusionSat: A Generative Foundation Model for Satellite Imagery [63.2807119794691]
We present DiffusionSat, to date the largest generative foundation model trained on a collection of publicly available large, high-resolution remote sensing datasets.
Our method produces realistic samples and can be used to solve multiple generative tasks including temporal generation, superresolution given multi-spectral inputs and in-painting.
arXiv Detail & Related papers (2023-12-06T16:53:17Z) - Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion [54.33764537135906]
VideoQA Transformer models demonstrate competitive performance on standard benchmarks.
Do these models capture the rich multimodal structures and dynamics from video and text jointly?
Are they achieving high scores by exploiting biases and spurious features?
arXiv Detail & Related papers (2023-06-15T06:45:46Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.