CMID: A Unified Self-Supervised Learning Framework for Remote Sensing
Image Understanding
- URL: http://arxiv.org/abs/2304.09670v2
- Date: Fri, 4 Aug 2023 02:42:50 GMT
- Title: CMID: A Unified Self-Supervised Learning Framework for Remote Sensing
Image Understanding
- Authors: Dilxat Muhtar, Xueliang Zhang, Pengfeng Xiao, Zhenshi Li, Feng Gu
- Abstract summary: Contrastive Mask Image Distillation (CMID) is capable of learning representations with both global semantic separability and local spatial perceptibility.
CMID is compatible with both convolutional neural networks (CNN) and vision transformers (ViT)
Models pre-trained using CMID achieve better performance than other state-of-the-art SSL methods on multiple downstream tasks.
- Score: 20.2438336674081
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning (SSL) has gained widespread attention in the remote
sensing (RS) and earth observation (EO) communities owing to its ability to
learn task-agnostic representations without human-annotated labels.
Nevertheless, most existing RS SSL methods are limited to learning either
global semantic separable or local spatial perceptible representations. We
argue that this learning strategy is suboptimal in the realm of RS, since the
required representations for different RS downstream tasks are often varied and
complex. In this study, we proposed a unified SSL framework that is better
suited for RS images representation learning. The proposed SSL framework,
Contrastive Mask Image Distillation (CMID), is capable of learning
representations with both global semantic separability and local spatial
perceptibility by combining contrastive learning (CL) with masked image
modeling (MIM) in a self-distillation way. Furthermore, our CMID learning
framework is architecture-agnostic, which is compatible with both convolutional
neural networks (CNN) and vision transformers (ViT), allowing CMID to be easily
adapted to a variety of deep learning (DL) applications for RS understanding.
Comprehensive experiments have been carried out on four downstream tasks (i.e.
scene classification, semantic segmentation, object-detection, and change
detection) and the results show that models pre-trained using CMID achieve
better performance than other state-of-the-art SSL methods on multiple
downstream tasks. The code and pre-trained models will be made available at
https://github.com/NJU-LHRS/official-CMID to facilitate SSL research and speed
up the development of RS images DL applications.
Related papers
- SSPA: Split-and-Synthesize Prompting with Gated Alignments for Multi-Label Image Recognition [71.90536979421093]
We propose a Split-and-Synthesize Prompting with Gated Alignments (SSPA) framework to amplify the potential of Vision-Language Models (VLMs)
We develop an in-context learning approach to associate the inherent knowledge from LLMs.
Then we propose a novel Split-and-Synthesize Prompting (SSP) strategy to first model the generic knowledge and downstream label semantics individually.
arXiv Detail & Related papers (2024-07-30T15:58:25Z) - De-coupling and De-positioning Dense Self-supervised Learning [65.56679416475943]
Dense Self-Supervised Learning (SSL) methods address the limitations of using image-level feature representations when handling images with multiple objects.
We show that they suffer from coupling and positional bias, which arise from the receptive field increasing with layer depth and zero-padding.
We demonstrate the benefits of our method on COCO and on a new challenging benchmark, OpenImage-MINI, for object classification, semantic segmentation, and object detection.
arXiv Detail & Related papers (2023-03-29T18:07:25Z) - A Dual-branch Self-supervised Representation Learning Framework for
Tumour Segmentation in Whole Slide Images [12.961686610789416]
Self-supervised learning (SSL) has emerged as an alternative solution to reduce the annotation overheads in whole slide images.
These SSL approaches are not designed for handling multi-resolution WSIs, which limits their performance in learning discriminative image features.
We propose a Dual-branch SSL Framework for WSI tumour segmentation (DSF-WSI) that can effectively learn image features from multi-resolution WSIs.
arXiv Detail & Related papers (2023-03-20T10:57:28Z) - Learning Common Rationale to Improve Self-Supervised Representation for
Fine-Grained Visual Recognition Problems [61.11799513362704]
We propose learning an additional screening mechanism to identify discriminative clues commonly seen across instances and classes.
We show that a common rationale detector can be learned by simply exploiting the GradCAM induced from the SSL objective.
arXiv Detail & Related papers (2023-03-03T02:07:40Z) - Object discovery and representation networks [78.16003886427885]
We propose a self-supervised learning paradigm that discovers the structure encoded in priors by itself.
Our method, Odin, couples object discovery and representation networks to discover meaningful image segmentations without any supervision.
arXiv Detail & Related papers (2022-03-16T17:42:55Z) - Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate.
We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR)
Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z) - Remote Sensing Images Semantic Segmentation with General Remote Sensing
Vision Model via a Self-Supervised Contrastive Learning Method [13.479068312825781]
We propose Global style and Local matching Contrastive Learning Network (GLCNet) for remote sensing semantic segmentation.
Specifically, the global style contrastive module is used to learn an image-level representation better.
The local features matching contrastive module is designed to learn representations of local regions which is beneficial for semantic segmentation.
arXiv Detail & Related papers (2021-06-20T03:03:40Z) - Self-Supervised Learning with Kernel Dependence Maximization [23.618292038419654]
We propose Self-Supervised Learning with the Hilbert-Schmidt Independence Criterion (SSL-HSIC)
SSL-HSIC maximizes dependence between representations of transformed versions of an image and the image identity.
This self-supervised learning framework yields a new understanding of InfoNCE, a variational lower bound on the mutual information (MI) between different transformations.
arXiv Detail & Related papers (2021-06-15T17:51:16Z) - Multi-Perspective LSTM for Joint Visual Representation Learning [81.21490913108835]
We present a novel LSTM cell architecture capable of learning both intra- and inter-perspective relationships available in visual sequences captured from multiple perspectives.
Our architecture adopts a novel recurrent joint learning strategy that uses additional gates and memories at the cell level.
We show that by using the proposed cell to create a network, more effective and richer visual representations are learned for recognition tasks.
arXiv Detail & Related papers (2021-05-06T16:44:40Z) - Remote Sensing Image Scene Classification with Self-Supervised Paradigm
under Limited Labeled Samples [11.025191332244919]
We introduce new self-supervised learning (SSL) mechanism to obtain the high-performance pre-training model for RSIs scene classification from large unlabeled data.
Experiments on three commonly used RSIs scene classification datasets demonstrated that this new learning paradigm outperforms the traditional dominant ImageNet pre-trained model.
The insights distilled from our studies can help to foster the development of SSL in the remote sensing community.
arXiv Detail & Related papers (2020-10-02T09:27:19Z) - More Diverse Means Better: Multimodal Deep Learning Meets Remote Sensing
Imagery Classification [43.35966675372692]
We show how to train deep networks and build the network architecture.
In particular, we show different fusion strategies as well as how to train deep networks and build the network architecture.
Our framework is not only limited to pixel-wise classification tasks but also applicable to spatial information modeling with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2020-08-12T17:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.