SatMAE: Pre-training Transformers for Temporal and Multi-Spectral
Satellite Imagery
- URL: http://arxiv.org/abs/2207.08051v1
- Date: Sun, 17 Jul 2022 01:35:29 GMT
- Title: SatMAE: Pre-training Transformers for Temporal and Multi-Spectral
Satellite Imagery
- Authors: Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi,
Yutong He, Marshall Burke, David B. Lobell, Stefano Ermon
- Abstract summary: We present SatMAE, a pre-training framework for temporal or multi-spectral satellite imagery based on Masked Autoencoder (MAE)
To leverage temporal information, we include a temporal embedding along with independently masking image patches across time.
- Score: 74.82821342249039
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unsupervised pre-training methods for large vision models have shown to
enhance performance on downstream supervised tasks. Developing similar
techniques for satellite imagery presents significant opportunities as
unlabelled data is plentiful and the inherent temporal and multi-spectral
structure provides avenues to further improve existing pre-training strategies.
In this paper, we present SatMAE, a pre-training framework for temporal or
multi-spectral satellite imagery based on Masked Autoencoder (MAE). To leverage
temporal information, we include a temporal embedding along with independently
masking image patches across time. In addition, we demonstrate that encoding
multi-spectral data as groups of bands with distinct spectral positional
encodings is beneficial. Our approach yields strong improvements over previous
state-of-the-art techniques, both in terms of supervised learning performance
on benchmark datasets (up to $\uparrow$ 7\%), and transfer learning performance
on downstream remote sensing tasks, including land cover classification (up to
$\uparrow$ 14\%) and semantic segmentation.
Related papers
- SatSwinMAE: Efficient Autoencoding for Multiscale Time-series Satellite Imagery [1.6180992915701702]
We extend the SwinE model to integrate temporal information for satellite time-series data.
The architecture employs a hierarchical 3D Masked Autoencoder (MAE) with Video Swin Transformer blocks.
Our approach shows significant performance improvements over existing state-of-the-art foundation models for all the evaluated downstream tasks.
arXiv Detail & Related papers (2024-05-03T22:55:56Z) - SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation [69.42764583465508]
We explore the potential of generative image diffusion to address the scarcity of annotated data in earth observation tasks.
To the best of our knowledge, we are the first to generate both images and corresponding masks for satellite segmentation.
arXiv Detail & Related papers (2024-03-25T10:30:22Z) - Rethinking Transformers Pre-training for Multi-Spectral Satellite
Imagery [78.43828998065071]
Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks.
Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data.
In this paper, we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities.
arXiv Detail & Related papers (2024-03-08T16:18:04Z) - Temporal Embeddings: Scalable Self-Supervised Temporal Representation
Learning from Spatiotemporal Data for Multimodal Computer Vision [1.4127889233510498]
A novel approach is proposed to stratify landscape based on mobility activity time series.
The pixel-wise embeddings are converted to image-like channels that can be used for task-based, multimodal modeling.
arXiv Detail & Related papers (2023-10-16T02:53:29Z) - Learning Semantic Segmentation with Query Points Supervision on Aerial Images [57.09251327650334]
We present a weakly supervised learning algorithm to train semantic segmentation algorithms.
Our proposed approach performs accurate semantic segmentation and improves efficiency by significantly reducing the cost and time required for manual annotation.
arXiv Detail & Related papers (2023-09-11T14:32:04Z) - Self-Supervised Representation Learning from Temporal Ordering of
Automated Driving Sequences [49.91741677556553]
We propose TempO, a temporal ordering pretext task for pre-training region-level feature representations for perception tasks.
We embed each frame by an unordered set of proposal feature vectors, a representation that is natural for object detection or tracking systems.
Extensive evaluations on the BDD100K, nuImages, and MOT17 datasets show that our TempO pre-training approach outperforms single-frame self-supervised learning methods.
arXiv Detail & Related papers (2023-02-17T18:18:27Z) - ViTs for SITS: Vision Transformers for Satellite Image Time Series [52.012084080257544]
We introduce a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the Vision Transformer (ViT)
TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder.
arXiv Detail & Related papers (2023-01-12T11:33:07Z) - Towards On-Board Panoptic Segmentation of Multispectral Satellite Images [41.34294145237618]
We propose a lightweight pipeline for on-board panoptic segmentation of multi-spectral satellite images.
Panoptic segmentation offers major economic and environmental insights, ranging from yield estimation from agricultural lands to intelligence for complex military applications.
Our evaluations demonstrate a substantial increase in accuracy metrics compared to the existing state-of-the-art models.
arXiv Detail & Related papers (2022-04-05T03:10:39Z) - Multi-Modal Temporal Attention Models for Crop Mapping from Satellite
Time Series [7.379078963413671]
Motivated by the recent success of temporal attention-based methods across multiple crop mapping tasks, we propose to investigate how these models can be adapted to operate on several modalities.
We implement and evaluate multiple fusion schemes, including a novel approach and simple adjustments to the training procedure.
We show that most fusion schemes have advantages and drawbacks, making them relevant for specific settings.
We then evaluate the benefit of multimodality across several tasks: parcel classification, pixel-based segmentation, and panoptic parcel segmentation.
arXiv Detail & Related papers (2021-12-14T17:05:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.