Cross-Scale MAE: A Tale of Multi-Scale Exploitation in Remote Sensing
- URL: http://arxiv.org/abs/2401.15855v1
- Date: Mon, 29 Jan 2024 03:06:19 GMT
- Title: Cross-Scale MAE: A Tale of Multi-Scale Exploitation in Remote Sensing
- Authors: Maofeng Tang, Andrei Cozma, Konstantinos Georgiou, Hairong Qi
- Abstract summary: We present Cross-Scale MAE, a self-supervised model built upon the Masked Auto-Encoder (MAE).During pre-training, Cross-Scale MAE employs scale augmentation techniques and enforces cross-scale constraints through both contrastive and generative losses.
Experimental evaluations demonstrate that Cross-Scale MAE exhibits superior performance compared to standard MAE and other state-of-the-art remote sensing MAE methods.
- Score: 5.325585142755542
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Remote sensing images present unique challenges to image analysis due to the
extensive geographic coverage, hardware limitations, and misaligned multi-scale
images. This paper revisits the classical multi-scale representation learning
problem but under the general framework of self-supervised learning for remote
sensing image understanding. We present Cross-Scale MAE, a self-supervised
model built upon the Masked Auto-Encoder (MAE).During pre-training, Cross-Scale
MAE employs scale augmentation techniques and enforces cross-scale consistency
constraints through both contrastive and generative losses to ensure consistent
and meaningful representations well-suited for a wide range of downstream
tasks. Further, our implementation leverages the xFormers library to accelerate
network pre-training on a single GPU while maintaining the quality of learned
representations. Experimental evaluations demonstrate that Cross-Scale MAE
exhibits superior performance compared to standard MAE and other
state-of-the-art remote sensing MAE methods.
Related papers
- RS-Mamba for Large Remote Sensing Image Dense Prediction [58.12667617617306]
We propose the Remote Sensing Mamba (RSM) for dense prediction tasks in large VHR remote sensing images.
RSM is specifically designed to capture the global context of remote sensing images with linear complexity.
Our model achieves better efficiency and accuracy than transformer-based models on large remote sensing images.
arXiv Detail & Related papers (2024-04-03T12:06:01Z) - MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training [103.72844619581811]
We build performant Multimodal Large Language Models (MLLMs)
In particular, we study the importance of various architecture components and data choices.
We demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data.
arXiv Detail & Related papers (2024-03-14T17:51:32Z) - Rethinking Transformers Pre-training for Multi-Spectral Satellite
Imagery [78.43828998065071]
Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks.
Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data.
In this paper, we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities.
arXiv Detail & Related papers (2024-03-08T16:18:04Z) - Feature Guided Masked Autoencoder for Self-supervised Learning in Remote
Sensing [16.683132793313693]
Masked AutoEncoder (MAE) has attracted wide attention for pretraining vision transformers in remote sensing.
We propose Feature Guided Masked Autoencoder (FG-MAE): reconstructing a combination of Histograms of Oriented Graidents (HOG) and Normalized Difference Indices (NDI) for multispectral images, and reconstructing HOG for SAR images.
arXiv Detail & Related papers (2023-10-28T09:43:13Z) - CtxMIM: Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding [38.53988682814626]
We propose a context-enhanced masked image modeling method (CtxMIM) for remote sensing image understanding.
CtxMIM formulates original image patches as a reconstructive template and employs a Siamese framework to operate on two sets of image patches.
With the simple and elegant design, CtxMIM encourages the pre-training model to learn object-level or pixel-level features on a large-scale dataset.
arXiv Detail & Related papers (2023-09-28T18:04:43Z) - Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial
Representation Learning [55.762840052788945]
We present Scale-MAE, a pretraining method that explicitly learns relationships between data at different, known scales.
We find that tasking the network with reconstructing both low/high frequency images leads to robust multiscale representations for remote sensing imagery.
arXiv Detail & Related papers (2022-12-30T03:15:34Z) - Multi-Spectral Image Classification with Ultra-Lean Complex-Valued
Models [28.798100220715686]
Multi-spectral imagery is invaluable for remote sensing due to different spectral signatures exhibited by materials.
We apply complex-valued co-domain symmetric models to classify real-valued MSI images.
Our work is the first to demonstrate the value of complex-valued deep learning on real-valued MSI data.
arXiv Detail & Related papers (2022-11-21T19:01:53Z) - Exploring The Role of Mean Teachers in Self-supervised Masked
Auto-Encoders [64.03000385267339]
Masked image modeling (MIM) has become a popular strategy for self-supervised learning(SSL) of visual representations with Vision Transformers.
We present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE.
RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training.
arXiv Detail & Related papers (2022-10-05T08:08:55Z) - X-ModalNet: A Semi-Supervised Deep Cross-Modal Network for
Classification of Remote Sensing Data [69.37597254841052]
We propose a novel cross-modal deep-learning framework called X-ModalNet.
X-ModalNet generalizes well, owing to propagating labels on an updatable graph constructed by high-level features on the top of the network.
We evaluate X-ModalNet on two multi-modal remote sensing datasets (HSI-MSI and HSI-SAR) and achieve a significant improvement in comparison with several state-of-the-art methods.
arXiv Detail & Related papers (2020-06-24T15:29:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.