Related papers: Spatio-Temporal SwinMAE: A Swin Transformer based Multiscale Representation Learner for Temporal Satellite Imagery

Spatio-Temporal SwinMAE: A Swin Transformer based Multiscale Representation Learner for Temporal Satellite Imagery

URL: http://arxiv.org/abs/2405.02512v1
Date: Fri, 3 May 2024 22:55:56 GMT
Title: Spatio-Temporal SwinMAE: A Swin Transformer based Multiscale Representation Learner for Temporal Satellite Imagery
Authors: Yohei Nakayama, Jiawei Su,
Abstract summary: This paper presents Spatio-Temporal SwinMAE (ST-SwinMAE), an architecture which particularly focuses on representation learning for temporal image processing. We present a pretrained model named Degas 100M as a geospatial foundation model. Also, we propose an approach for transfer learning with Degas 100M, which both pretrained encoder and decoder of MAE are utilized. Our approach shows significant improvements of performance over existing state-of-the-art foundation models.
Score: 1.8185814461140652
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Currently, the foundation models represented by large language models have made dramatic progress and are used in a very wide range of domains including 2D and 3D vision. As one of the important application domains of foundation models, earth observation has attracted attention and various approaches have been developed. When considering earth observation as a single image capture, earth observation imagery can be processed as an image with three or more channels, and when it comes with multiple image captures of different timestamps at one location, the temporal observation can be considered as a set of continuous image resembling video frames or medical SCAN slices. This paper presents Spatio-Temporal SwinMAE (ST-SwinMAE), an architecture which particularly focuses on representation learning for spatio-temporal image processing. Specifically, it uses a hierarchical Masked Auto-encoder (MAE) with Video Swin Transformer blocks. With the architecture, we present a pretrained model named Degas 100M as a geospatial foundation model. Also, we propose an approach for transfer learning with Degas 100M, which both pretrained encoder and decoder of MAE are utilized with skip connections added between them to achieve multi-scale information communication, forms an architecture named Spatio-Temporal SwinUNet (ST-SwinUNet). Our approach shows significant improvements of performance over existing state-of-the-art of foundation models. Specifically, for transfer learning of the land cover downstream task on the PhilEO Bench dataset, it shows 10.4\% higher accuracy compared with other geospatial foundation models on average.

Related papers

Using Multiple Input Modalities Can Improve Data-Efficiency and O.O.D. Generalization for ML with Satellite Imagery [3.3964392722361785]
A large majority of machine learning models trained on satellite imagery (SatML) are designed primarily for optical input modalities such as multi-spectral satellite imagery.<n>We generate augmented versions of SatML benchmark tasks by appending additional geographic data layers to datasets spanning classification, regression, and segmentation.<n>We find that fusing additional geographic inputs with optical imagery can significantly improve SatML model performance.
arXiv Detail & Related papers (2025-07-15T22:57:29Z)
AgriFM: A Multi-source Temporal Remote Sensing Foundation Model for Crop Mapping [11.187551725609099]
Transformer-based remote sensing foundation models (RSFMs) offer potential for crop mapping due to their ability for unified processing.<n>We present AgriFM, a multi-temporal remote sensing foundation model specifically designed for agricultural crop mapping.
arXiv Detail & Related papers (2025-05-27T15:50:14Z)
TiMo: Spatiotemporal Foundation Model for Satellite Image Time Series [39.22426645737932]
TiMo is a novel hierarchical vision transformer foundation model tailored for SITS analysis.<n>At its core, we introduce atemporal attention mechanism that dynamically captures multiscale patterns across both time and space.<n>Extensive experiments across multipletemporal tasks-including deforestation monitoring-demonstrate TiMo's superiority over state-of-theart methods.
arXiv Detail & Related papers (2025-05-13T16:35:11Z)
Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data [14.104497777255137]
We introduce Low-rank Efficient Spatial-Spectral Vision Transformer with three key innovations. We pretrain LESS ViT using a Hyperspectral Masked Autoencoder framework with integrated positional and channel masking strategies. Experimental results demonstrate that our proposed method achieves competitive performance against state-of-the-art multi-modal geospatial foundation models.
arXiv Detail & Related papers (2025-03-17T05:42:19Z)
SatMamba: Development of Foundation Models for Remote Sensing Imagery Using State Space Models [0.0]
Foundation models refer to deep learning models pretrained on large unlabeled datasets through self-supervised algorithms. Various foundation models have been developed for remote sensing, such as those for multispectral, high-resolution, and hyperspectral images. This research proposes SatMamba, a new pretraining framework that combines masked autoencoders with State Space Model.
arXiv Detail & Related papers (2025-02-01T14:07:21Z)
EarthView: A Large Scale Remote Sensing Dataset for Self-Supervision [72.84868704100595]
This paper presents a dataset specifically designed for self-supervision on remote sensing data, intended to enhance deep learning applications on Earth monitoring tasks. The dataset spans 15 tera pixels of global remote-sensing data, combining imagery from a diverse range of sources, including NEON, Sentinel, and a novel release of 1m spatial resolution data from Satellogic. Accompanying the dataset is EarthMAE, a tailored Masked Autoencoder developed to tackle the distinct challenges of remote sensing data.
arXiv Detail & Related papers (2025-01-14T13:42:22Z)
Deep Multimodal Fusion for Semantic Segmentation of Remote Sensing Earth Observation Data [0.08192907805418582]
This paper proposes a late fusion deep learning model (LF-DLM) for semantic segmentation. One branch integrates detailed textures from aerial imagery captured by UNetFormer with a Multi-Axis Vision Transformer (ViT) backbone. The other branch captures complex-temporal dynamics from the Sentinel-2 satellite imageMax time series using a U-ViNet with Temporal Attention (U-TAE)
arXiv Detail & Related papers (2024-10-01T07:50:37Z)
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan. The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z)
MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning [9.540487697801531]
MMEarth is a diverse multi-modal pretraining dataset at global scale. We propose a Multi-Pretext Masked Autoencoder (MP-MAE) approach to learn general-purpose representations for optical satellite images.
arXiv Detail & Related papers (2024-05-04T23:16:48Z)
SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation [69.42764583465508]
We explore the potential of generative image diffusion to address the scarcity of annotated data in earth observation tasks. To the best of our knowledge, we are the first to generate both images and corresponding masks for satellite segmentation.
arXiv Detail & Related papers (2024-03-25T10:30:22Z)
SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery [35.550999964460466]
We present SkySense, a generic billion-scale model, pre-trained on a curated multi-modal Remote Sensing dataset with 21.5 million temporal sequences. To our best knowledge, SkySense is the largest Multi-Modal to date, whose modules can be flexibly combined or used individually to accommodate various tasks.
arXiv Detail & Related papers (2023-12-15T09:57:21Z)
PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking [90.29143475328506]
We introduce PointOdyssey, a large-scale synthetic dataset, and data generation framework. Our goal is to advance the state-of-the-art by placing emphasis on long videos with naturalistic motion. We animate deformable characters using real-world motion capture data, we build 3D scenes to match the motion capture environments, and we render camera viewpoints using trajectories mined via structure-from-motion on real videos.
arXiv Detail & Related papers (2023-07-27T17:58:11Z)
Semantic Segmentation of Vegetation in Remote Sensing Imagery Using Deep Learning [77.34726150561087]
We propose an approach for creating a multi-modal and large-temporal dataset comprised of publicly available Remote Sensing data. We use Convolutional Neural Networks (CNN) models that are capable of separating different classes of vegetation.
arXiv Detail & Related papers (2022-09-28T18:51:59Z)
SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery [74.82821342249039]
We present SatMAE, a pre-training framework for temporal or multi-spectral satellite imagery based on Masked Autoencoder (MAE) To leverage temporal information, we include a temporal embedding along with independently masking image patches across time.
arXiv Detail & Related papers (2022-07-17T01:35:29Z)
Investigating Temporal Convolutional Neural Networks for Satellite Image Time Series Classification: A survey [0.0]
Temporal CNNs have been employed for SITS classification tasks with encouraging results. This paper seeks to survey this method against a plethora of other contemporary methods for SITS classification to validate the existing findings in recent literature. Experiments are carried out on two benchmark SITS datasets with the results demonstrating that Temporal CNNs display a superior performance to the comparative benchmark algorithms.
arXiv Detail & Related papers (2022-04-13T14:08:14Z)
Real-Time High-Performance Semantic Image Segmentation of Urban Street Scenes [98.65457534223539]
We propose a real-time high-performance DCNN-based method for robust semantic segmentation of urban street scenes. The proposed method achieves the accuracy of 73.6% and 68.0% mean Intersection over Union (mIoU) with the inference speed of 51.0 fps and 39.3 fps.
arXiv Detail & Related papers (2020-03-11T08:45:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.