Remote Sensing Scene Classification with Masked Image Modeling (MIM)
- URL: http://arxiv.org/abs/2302.14256v2
- Date: Fri, 24 Mar 2023 17:43:20 GMT
- Title: Remote Sensing Scene Classification with Masked Image Modeling (MIM)
- Authors: Liya Wang, Alex Tien
- Abstract summary: Self-supervised learning (SSL) technique has been shown as a better way for learning visual feature representation.
This research aims to explore the potential of MIM pretrained backbones on four well-known classification datasets.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Remote sensing scene classification has been extensively studied for its
critical roles in geological survey, oil exploration, traffic management,
earthquake prediction, wildfire monitoring, and intelligence monitoring. In the
past, the Machine Learning (ML) methods for performing the task mainly used the
backbones pretrained in the manner of supervised learning (SL). As Masked Image
Modeling (MIM), a self-supervised learning (SSL) technique, has been shown as a
better way for learning visual feature representation, it presents a new
opportunity for improving ML performance on the scene classification task. This
research aims to explore the potential of MIM pretrained backbones on four
well-known classification datasets: Merced, AID, NWPU-RESISC45, and Optimal-31.
Compared to the published benchmarks, we show that the MIM pretrained Vision
Transformer (ViTs) backbones outperform other alternatives (up to 18% on top 1
accuracy) and that the MIM technique can learn better feature representation
than the supervised learning counterparts (up to 5% on top 1 accuracy).
Moreover, we show that the general-purpose MIM-pretrained ViTs can achieve
competitive performance as the specially designed yet complicated Transformer
for Remote Sensing (TRS) framework. Our experiment results also provide a
performance baseline for future studies.
Related papers
- Rethinking Pre-trained Feature Extractor Selection in Multiple Instance Learning for Whole Slide Image Classification [2.6703221234079946]
Multiple instance learning (MIL) has become a preferred method for gigapixel whole slide image (WSI) classification without requiring patch-level annotations.
This study systematically evaluating MIL feature extractors across three dimensions: pre-training dataset, backbone model, and pre-training method.
Our findings reveal that selecting a robust self-supervised learning (SSL) method has a greater impact on performance than relying solely on an in-domain pre-training dataset.
arXiv Detail & Related papers (2024-08-02T10:34:23Z) - An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training [51.622652121580394]
Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) has enabled promising downstream performance on top of the learned self-supervised ViT features.
In this paper, we question if the textitextremely simple lightweight ViTs' fine-tuning performance can also benefit from this pre-training paradigm.
Our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design ($5.7M$/$6.5M$) can achieve $79.4%$/$78.9%$ top-1 accuracy on ImageNet-1
arXiv Detail & Related papers (2024-04-18T14:14:44Z) - MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining [73.81862342673894]
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks.
transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks.
We conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection.
Our models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection.
arXiv Detail & Related papers (2024-03-20T09:17:22Z) - Learning Semantic Proxies from Visual Prompts for Parameter-Efficient Fine-Tuning in Deep Metric Learning [13.964106147449051]
Existing solutions concentrate on fine-tuning the pre-trained models on conventional image datasets.
We propose a novel and effective framework based on learning Visual Prompts (VPT) in the pre-trained Vision Transformers (ViT)
We demonstrate that our new approximations with semantic information are superior to representative capabilities.
arXiv Detail & Related papers (2024-02-04T04:42:05Z) - GenCo: An Auxiliary Generator from Contrastive Learning for Enhanced
Few-Shot Learning in Remote Sensing [9.504503675097137]
We introduce a generator-based contrastive learning framework (GenCo) that pre-trains backbones and simultaneously explores variants of feature samples.
In fine-tuning, the auxiliary generator can be used to enrich limited labeled data samples in feature space.
We demonstrate the effectiveness of our method in improving few-shot learning performance on two key remote sensing datasets.
arXiv Detail & Related papers (2023-07-27T03:59:19Z) - In-Domain Self-Supervised Learning Improves Remote Sensing Image Scene
Classification [5.323049242720532]
Self-supervised learning has emerged as a promising approach for remote sensing image classification.
We present a study of different self-supervised pre-training strategies and evaluate their effect across 14 downstream datasets.
arXiv Detail & Related papers (2023-07-04T10:57:52Z) - FreMIM: Fourier Transform Meets Masked Image Modeling for Medical Image
Segmentation [37.465246717967595]
We present a new MIM-based framework named FreMIM for self-supervised pre-training to better accomplish medical image segmentation tasks.
Our FreMIM could consistently bring considerable improvements to model performance.
arXiv Detail & Related papers (2023-04-21T10:23:34Z) - CAE v2: Context Autoencoder with CLIP Target [63.61868058214267]
Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches.
Applying the reconstruction supervision on the CLIP representation has been proven effective for MIM.
To investigate strategies for refining the CLIP-targeted MIM, we study two critical elements in MIM, i.e., the supervision position and the mask ratio.
arXiv Detail & Related papers (2022-11-17T18:58:33Z) - Exploring The Role of Mean Teachers in Self-supervised Masked
Auto-Encoders [64.03000385267339]
Masked image modeling (MIM) has become a popular strategy for self-supervised learning(SSL) of visual representations with Vision Transformers.
We present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE.
RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training.
arXiv Detail & Related papers (2022-10-05T08:08:55Z) - SLIP: Self-supervision meets Language-Image Pre-training [79.53764315471543]
We study whether self-supervised learning can aid in the use of language supervision for visual representation learning.
We introduce SLIP, a multi-task learning framework for combining self-supervised learning and CLIP pre-training.
We find that SLIP enjoys the best of both worlds: better performance than self-supervision and language supervision.
arXiv Detail & Related papers (2021-12-23T18:07:13Z) - Benchmarking Detection Transfer Learning with Vision Transformers [60.97703494764904]
complexity of object detection methods can make benchmarking non-trivial when new architectures, such as Vision Transformer (ViT) models, arrive.
We present training techniques that overcome these challenges, enabling the use of standard ViT models as the backbone of Mask R-CNN.
Our results show that recent masking-based unsupervised learning methods may, for the first time, provide convincing transfer learning improvements on COCO.
arXiv Detail & Related papers (2021-11-22T18:59:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.