Related papers: Scaling Efficient Masked Autoencoder Learning on Large Remote Sensing Dataset

Scaling Efficient Masked Autoencoder Learning on Large Remote Sensing Dataset

URL: http://arxiv.org/abs/2406.11933v1
Date: Mon, 17 Jun 2024 15:41:57 GMT
Title: Scaling Efficient Masked Autoencoder Learning on Large Remote Sensing Dataset
Authors: Fengxiang Wang, Hongzhen Wang, Di Wang, Zonghao Guo, Zhenyu Zhong, Long Lan, Jing Zhang, Zhiyuan Liu, Maosong Sun,
Abstract summary: This study introduces textbfRS-4M, a large-scale dataset designed to enable highly efficient MIM training on RS images. We propose an efficient MIM method, termed textbfSelectiveMAE, which dynamically encodes and reconstructs a subset of patch tokens selected based on their semantic richness. Experiments show that SelectiveMAE significantly boosts training efficiency by 2.2-2.7 times and enhances the classification, detection, and segmentation performance of the baseline MIM model.
Score: 66.15872913664407
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Masked Image Modeling (MIM) has emerged as a pivotal approach for developing foundational visual models in the field of remote sensing (RS). However, current RS datasets are limited in volume and diversity, which significantly constrains the capacity of MIM methods to learn generalizable representations. In this study, we introduce \textbf{RS-4M}, a large-scale dataset designed to enable highly efficient MIM training on RS images. RS-4M comprises 4 million optical images encompassing abundant and fine-grained RS visual tasks, including object-level detection and pixel-level segmentation. Compared to natural images, RS images often contain massive redundant background pixels, which limits the training efficiency of the conventional MIM models. To address this, we propose an efficient MIM method, termed \textbf{SelectiveMAE}, which dynamically encodes and reconstructs a subset of patch tokens selected based on their semantic richness. SelectiveMAE roots in a progressive semantic token selection module, which evolves from reconstructing semantically analogical tokens to encoding complementary semantic dependencies. This approach transforms conventional MIM training into a progressive feature learning process, enabling SelectiveMAE to efficiently learn robust representations of RS images. Extensive experiments show that SelectiveMAE significantly boosts training efficiency by 2.2-2.7 times and enhances the classification, detection, and segmentation performance of the baseline MIM model.The dataset, source code, and trained models will be released.

Related papers

Wavelet-Driven Masked Image Modeling: A Path to Efficient Visual Representation [27.576174611043367]
Masked Image Modeling (MIM) has garnered significant attention in self-supervised learning, thanks to its impressive capacity to learn scalable visual representations tailored for downstream tasks. However, images inherently contain abundant redundant information, leading the pixel-based MIM reconstruction process to focus excessively on finer details such as textures, thus prolonging training times unnecessarily. In this study, we leverage wavelet transform as a tool for efficient representation learning to expedite the training process of MIM.
arXiv Detail & Related papers (2025-03-02T08:11:26Z)
Cross-Modal Mapping: Mitigating the Modality Gap for Few-Shot Image Classification [13.238769012534922]
We propose a novel Cross-Modal Mapping (CMM) method for few-shot image classification.<n>CMM aligns image features with the text feature space through linear transformation.<n>It improves the average Top-1 accuracy by 1.06% on 11 benchmark datasets.
arXiv Detail & Related papers (2024-12-28T10:40:21Z)
Learning De-Biased Representations for Remote-Sensing Imagery [27.218010145001312]
Remote sensing (RS) imagery requires specialized satellites to collect and be difficult to annotate. Due to data scarcity, training any large-scale RS models from scratch is unrealistic. We propose debLoRA, a generic training approach that works with any LoRA variants to yield debiased features.
arXiv Detail & Related papers (2024-10-06T16:47:30Z)
Cross-Scan Mamba with Masked Training for Robust Spectral Imaging [51.557804095896174]
We propose the Cross-Scanning Mamba, named CS-Mamba, that employs a Spatial-Spectral SSM for global-local balanced context encoding. Experiment results show that our CS-Mamba achieves state-of-the-art performance and the masked training method can better reconstruct smooth features to improve the visual quality.
arXiv Detail & Related papers (2024-08-01T15:14:10Z)
Discriminative Hamiltonian Variational Autoencoder for Accurate Tumor Segmentation in Data-Scarce Regimes [2.8498944632323755]
We propose an end-to-end hybrid architecture for medical image segmentation. We use Hamiltonian Variational Autoencoders (HVAE) and a discriminative regularization to improve the quality of generated images. Our architecture operates on a slice-by-slice basis to segment 3D volumes, capitilizing on the richly augmented dataset.
arXiv Detail & Related papers (2024-06-17T15:42:08Z)
A$^{2}$-MAE: A spatial-temporal-spectral unified remote sensing pre-training method based on anchor-aware masked autoencoder [26.81539884309151]
Remote sensing (RS) data provide Earth observations across multiple dimensions, encompassing critical spatial, temporal, and spectral information. Despite various pre-training methods tailored to the characteristics of RS data, a key limitation persists: the inability to effectively integrate spatial, temporal, and spectral information within a single unified model. We propose an Anchor-Aware Masked AutoEncoder method (A$2$-MAE), leveraging intrinsic complementary information from the different kinds of images and geo-information to reconstruct the masked patches during the pre-training phase.
arXiv Detail & Related papers (2024-06-12T11:02:15Z)
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing. Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery. We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z)
SS-MAE: Spatial-Spectral Masked Auto-Encoder for Multi-Source Remote Sensing Image Classification [35.52272615695294]
We propose a spatial-spectral masked auto-encoder (SS-MAE) for HSI and LiDAR/SAR data joint classification. Our SS-MAE fully exploits the spatial and spectral representations of the input data. To complement local features in the training stage, we add two lightweight CNNs for feature extraction.
arXiv Detail & Related papers (2023-11-08T03:54:44Z)
Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement Learning [53.00683059396803]
Mask image model (MIM) has been widely used due to its simplicity and effectiveness in recovering original information from masked images. We propose a decision-based MIM that utilizes reinforcement learning (RL) to automatically search for optimal image masking ratio and masking strategy. Our approach has a significant advantage over alternative self-supervised methods on the task of neuron segmentation.
arXiv Detail & Related papers (2023-10-06T10:40:46Z)
On quantifying and improving realism of images generated with diffusion [50.37578424163951]
We propose a metric, called Image Realism Score (IRS), computed from five statistical measures of a given image. IRS is easily usable as a measure to classify a given image as real or fake. We experimentally establish the model- and data-agnostic nature of the proposed IRS by successfully detecting fake images generated by Stable Diffusion Model (SDM), Dalle2, Midjourney and BigGAN. Our efforts have also led to Gen-100 dataset, which provides 1,000 samples for 100 classes generated by four high-quality models.
arXiv Detail & Related papers (2023-09-26T08:32:55Z)
RingMo-lite: A Remote Sensing Multi-task Lightweight Network with CNN-Transformer Hybrid Framework [15.273362355253779]
This paper proposes RingMo-lite, an RS multi-task lightweight network with a CNN-Transformer hybrid framework to optimize the interpretation process. The proposed RingMo-lite reduces the parameters over 60% in various RS image interpretation tasks, the average accuracy drops by less than 2% in most of the scenes and achieves SOTA performance compared to models of the similar size.
arXiv Detail & Related papers (2023-09-16T14:15:59Z)
Improving Pixel-based MIM by Reducing Wasted Modeling Capability [77.99468514275185]
We propose a new method that explicitly utilizes low-level features from shallow layers to aid pixel reconstruction. To the best of our knowledge, we are the first to systematically investigate multi-level feature fusion for isotropic architectures. Our method yields significant performance gains, such as 1.2% on fine-tuning, 2.8% on linear probing, and 2.6% on semantic segmentation.
arXiv Detail & Related papers (2023-08-01T03:44:56Z)
Disruptive Autoencoders: Leveraging Low-level features for 3D Medical Image Pre-training [51.16994853817024]
This work focuses on designing an effective pre-training framework for 3D radiology images. We introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations. The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-31T17:59:42Z)
PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling [83.67628239775878]
Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked Autoencoders (MAE) and BEiT. This paper undertakes a fundamental analysis of MIM from the perspective of pixel reconstruction. We propose a remarkably simple and effective method, ourmethod, that entails two strategies.
arXiv Detail & Related papers (2023-03-04T13:38:51Z)
Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval [21.05804942940532]
Cross-modal text-image retrieval has attracted extensive attention for its advantages of flexible input and efficient query. To cope with the problem of multi-scale scarcity and target redundancy in RS multimodal retrieval task, we come up with a novel asymmetric multimodal feature matching network (AMFMN) Our model adapts to multi-scale feature inputs, favors multi-source retrieval methods, and can dynamically filter redundant features.
arXiv Detail & Related papers (2022-04-21T03:53:19Z)
DML-GANR: Deep Metric Learning With Generative Adversarial Network Regularization for High Spatial Resolution Remote Sensing Image Retrieval [9.423185775609426]
We develop a deep metric learning approach with generative adversarial network regularization (DML-GANR) for HSR-RSI retrieval. The experimental results on the three data sets demonstrate the superior performance of DML-GANR over state-of-the-art techniques in HSR-RSI retrieval.
arXiv Detail & Related papers (2020-10-07T02:26:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.