Self-supervised Vision Transformers for Joint SAR-optical Representation
Learning
- URL: http://arxiv.org/abs/2204.05381v1
- Date: Mon, 11 Apr 2022 19:42:53 GMT
- Title: Self-supervised Vision Transformers for Joint SAR-optical Representation
Learning
- Authors: Yi Wang, Conrad M Albrecht, Xiao Xiang Zhu
- Abstract summary: Self-supervised learning (SSL) has attracted much interest in remote sensing and earth observation.
We explore the potential of vision transformers (ViTs) for joint SAR-optical representation learning.
Based on DINO, a state-of-the-art SSL algorithm, we combine SAR and optical imagery by concatenating all channels to a unified input.
- Score: 19.316112344900638
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised learning (SSL) has attracted much interest in remote sensing
and earth observation due to its ability to learn task-agnostic representations
without human annotation. While most of the existing SSL works in remote
sensing utilize ConvNet backbones and focus on a single modality, we explore
the potential of vision transformers (ViTs) for joint SAR-optical
representation learning. Based on DINO, a state-of-the-art SSL algorithm that
distills knowledge from two augmented views of an input image, we combine SAR
and optical imagery by concatenating all channels to a unified input.
Subsequently, we randomly mask out channels of one modality as a data
augmentation strategy. While training, the model gets fed optical-only,
SAR-only, and SAR-optical image pairs learning both inner- and intra-modality
representations. Experimental results employing the BigEarthNet-MM dataset
demonstrate the benefits of both, the ViT backbones and the proposed multimodal
SSL algorithm DINO-MM.
Related papers
- SSPA: Split-and-Synthesize Prompting with Gated Alignments for Multi-Label Image Recognition [71.90536979421093]
We propose a Split-and-Synthesize Prompting with Gated Alignments (SSPA) framework to amplify the potential of Vision-Language Models (VLMs)
We develop an in-context learning approach to associate the inherent knowledge from LLMs.
Then we propose a novel Split-and-Synthesize Prompting (SSP) strategy to first model the generic knowledge and downstream label semantics individually.
arXiv Detail & Related papers (2024-07-30T15:58:25Z) - Dcl-Net: Dual Contrastive Learning Network for Semi-Supervised
Multi-Organ Segmentation [12.798684146496754]
We propose a two-stage Dual Contrastive Learning Network for semi-supervised MoS.
In Stage 1, we develop a similarity-guided global contrastive learning to explore the implicit continuity and similarity among images.
In Stage 2, we present an organ-aware local contrastive learning to further attract the class representations.
arXiv Detail & Related papers (2024-03-06T07:39:33Z) - Semi-Mamba-UNet: Pixel-Level Contrastive and Pixel-Level Cross-Supervised Visual Mamba-based UNet for Semi-Supervised Medical Image Segmentation [11.637738540262797]
This study introduces Semi-Mamba-UNet, which integrates a purely visual Mamba-based encoder-decoder architecture with a conventional CNN-based UNet into a semi-supervised learning framework.
This innovative SSL approach leverages both networks to generate pseudo-labels and cross-supervise one another at the pixel level simultaneously.
We introduce a self-supervised pixel-level contrastive learning strategy that employs a pair of projectors to enhance the feature learning capabilities further.
arXiv Detail & Related papers (2024-02-11T17:09:21Z) - Learning Vision from Models Rivals Learning Vision from Data [54.43596959598465]
We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions.
We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption.
We perform visual representation learning on these synthetic images via contrastive learning, treating images sharing the same caption as positive pairs.
arXiv Detail & Related papers (2023-12-28T18:59:55Z) - ViT-Lens: Towards Omni-modal Representations [64.66508684336614]
ViT-Lens-2 is a framework for representation learning of increasing modalities.
We show that ViT-Lens-2 can learn representations for 3D point cloud, depth, audio, tactile and EEG.
By seamlessly integrating ViT-Lens-2 into Multimodal Foundation Models, we enable Any-modality to Text and Image Generation.
arXiv Detail & Related papers (2023-11-27T18:52:09Z) - CMID: A Unified Self-Supervised Learning Framework for Remote Sensing
Image Understanding [20.2438336674081]
Contrastive Mask Image Distillation (CMID) is capable of learning representations with both global semantic separability and local spatial perceptibility.
CMID is compatible with both convolutional neural networks (CNN) and vision transformers (ViT)
Models pre-trained using CMID achieve better performance than other state-of-the-art SSL methods on multiple downstream tasks.
arXiv Detail & Related papers (2023-04-19T13:58:31Z) - Leveraging the Third Dimension in Contrastive Learning [88.17394309208925]
Self-Supervised Learning (SSL) methods operate on unlabeled data to learn robust representations useful for downstream tasks.
These augmentations ignore the fact that biological vision takes place in an immersive three-dimensional, temporally contiguous environment.
We explore two distinct approaches to incorporating depth signals into the SSL framework.
arXiv Detail & Related papers (2023-01-27T15:45:03Z) - Self-Supervised Learning for Invariant Representations from
Multi-Spectral and SAR Images [5.994412766684843]
Self-Supervised learning (SSL) has become the new state-of-art in several domain classification and segmentation tasks.
This work proposes RSDnet, which applies the distillation network (BYOL) in the remote sensing (RS) domain.
arXiv Detail & Related papers (2022-05-04T13:16:48Z) - Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate.
We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR)
Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z) - Contrastive Multiview Coding with Electro-optics for SAR Semantic
Segmentation [0.6445605125467573]
We propose multi-modal representation learning for SAR semantic segmentation.
Unlike previous studies, our method jointly uses EO imagery, SAR imagery, and a label mask.
Several experiments show that our approach is superior to the existing methods in model performance, sample efficiency, and convergence speed.
arXiv Detail & Related papers (2021-08-31T23:55:41Z) - ZS-SLR: Zero-Shot Sign Language Recognition from RGB-D Videos [49.337912335944026]
We formulate the problem of Zero-Shot Sign Language Recognition (ZS- SLR) and propose a two-stream model from two input modalities: RGB and Depth videos.
To benefit from the vision Transformer capabilities, we use two vision Transformer models, for human detection and visual features representation.
Atemporal representation from human body is obtained using vision Transformer and a LSTM network.
arXiv Detail & Related papers (2021-08-23T10:48:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.