Rethinking Transformers Pre-training for Multi-Spectral Satellite
Imagery
- URL: http://arxiv.org/abs/2403.05419v1
- Date: Fri, 8 Mar 2024 16:18:04 GMT
- Title: Rethinking Transformers Pre-training for Multi-Spectral Satellite
Imagery
- Authors: Mubashir Noman, Muzammal Naseer, Hisham Cholakkal, Rao Muhammad Anwar,
Salman Khan, Fahad Shahbaz Khan
- Abstract summary: Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks.
Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data.
In this paper, we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities.
- Score: 78.43828998065071
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in unsupervised learning have demonstrated the ability of
large vision models to achieve promising results on downstream tasks by
pre-training on large amount of unlabelled data. Such pre-training techniques
have also been explored recently in the remote sensing domain due to the
availability of large amount of unlabelled data. Different from standard
natural image datasets, remote sensing data is acquired from various sensor
technologies and exhibit diverse range of scale variations as well as
modalities. Existing satellite image pre-training methods either ignore the
scale information present in the remote sensing imagery or restrict themselves
to use only a single type of data modality. In this paper, we re-visit
transformers pre-training and leverage multi-scale information that is
effectively utilized with multiple modalities. Our proposed approach, named
SatMAE++, performs multi-scale pre-training and utilizes convolution based
upsampling blocks to reconstruct the image at higher scales making it
extensible to include more scales. Compared to existing works, the proposed
SatMAE++ with multi-scale pre-training is equally effective for both optical as
well as multi-spectral imagery. Extensive experiments on six datasets reveal
the merits of proposed contributions, leading to state-of-the-art performance
on all datasets. SatMAE++ achieves mean average precision (mAP) gain of 2.5\%
for multi-label classification task on BigEarthNet dataset. Our code and
pre-trained models are available at \url{https://github.com/techmn/satmae_pp}.
Related papers
- MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning [9.540487697801531]
MMEarth is a diverse multi-modal pretraining dataset at global scale.
We propose a Multi-Pretext Masked Autoencoder (MP-MAE) approach to learn general-purpose representations for optical satellite images.
arXiv Detail & Related papers (2024-05-04T23:16:48Z) - SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation [69.42764583465508]
We explore the potential of generative image diffusion to address the scarcity of annotated data in earth observation tasks.
To the best of our knowledge, we are the first to generate both images and corresponding masks for satellite segmentation.
arXiv Detail & Related papers (2024-03-25T10:30:22Z) - SynDrone -- Multi-modal UAV Dataset for Urban Scenarios [11.338399194998933]
The scarcity of large-scale real datasets with pixel-level annotations poses a significant challenge to researchers.
We propose a multimodal synthetic dataset containing both images and 3D data taken at multiple flying heights.
The dataset will be made publicly available to support the development of novel computer vision methods targeting UAV applications.
arXiv Detail & Related papers (2023-08-21T06:22:10Z) - Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition.
Specifically, we utilize the web-collected Coyo-700M dataset.
Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z) - Self-Supervised In-Domain Representation Learning for Remote Sensing
Image Scene Classification [1.0152838128195465]
Transferring the ImageNet pre-trained weights to the various remote sensing tasks has produced acceptable results.
Recent research has demonstrated that self-supervised learning methods capture visual features that are more discriminative and transferable.
We are motivated by these facts to pre-train the in-domain representations of remote sensing imagery using contrastive self-supervised learning.
arXiv Detail & Related papers (2023-02-03T15:03:07Z) - SatMAE: Pre-training Transformers for Temporal and Multi-Spectral
Satellite Imagery [74.82821342249039]
We present SatMAE, a pre-training framework for temporal or multi-spectral satellite imagery based on Masked Autoencoder (MAE)
To leverage temporal information, we include a temporal embedding along with independently masking image patches across time.
arXiv Detail & Related papers (2022-07-17T01:35:29Z) - Self-supervised Audiovisual Representation Learning for Remote Sensing Data [96.23611272637943]
We propose a self-supervised approach for pre-training deep neural networks in remote sensing.
By exploiting the correspondence between geo-tagged audio recordings and remote sensing, this is done in a completely label-free manner.
We show that our approach outperforms existing pre-training strategies for remote sensing imagery.
arXiv Detail & Related papers (2021-08-02T07:50:50Z) - Deep Multimodal Transfer-Learned Regression in Data-Poor Domains [0.0]
We propose a Deep Multimodal Transfer-Learned Regressor (DMTL-R) for multimodal learning of image and feature data.
Our model is capable of fine-tuning a given set of pre-trained CNN weights on a small amount of training image data.
We present results using phase-field simulation microstructure images with an accompanying set of physical features, using pre-trained weights from various well-known CNN architectures.
arXiv Detail & Related papers (2020-06-16T16:52:44Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.