Related papers: Position Prediction Self-Supervised Learning for Multimodal Satellite Imagery Semantic Segmentation

Position Prediction Self-Supervised Learning for Multimodal Satellite Imagery Semantic Segmentation

URL: http://arxiv.org/abs/2506.06852v2
Date: Wed, 16 Jul 2025 06:30:34 GMT
Title: Position Prediction Self-Supervised Learning for Multimodal Satellite Imagery Semantic Segmentation
Authors: John Waithaka, Moise Busogi,
Abstract summary: We propose adapting LOCA (Location-aware), a position prediction self-supervised learning method, for multimodal satellite imagery semantic segmentation.<n>Our approach addresses the unique challenges of satellite data by extending SatMAE's channel grouping from multispectral to multimodal data.<n>We evaluate our approach on the Sen1Floods11 flood mapping dataset, where it significantly outperforms existing reconstruction-based self-supervised learning methods for satellite imagery.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Semantic segmentation of satellite imagery is crucial for Earth observation applications, but remains constrained by limited labelled training data. While self-supervised pretraining methods like Masked Autoencoders (MAE) have shown promise, they focus on reconstruction rather than localisation-a fundamental aspect of segmentation tasks. We propose adapting LOCA (Location-aware), a position prediction self-supervised learning method, for multimodal satellite imagery semantic segmentation. Our approach addresses the unique challenges of satellite data by extending SatMAE's channel grouping from multispectral to multimodal data, enabling effective handling of multiple modalities, and introducing same-group attention masking to encourage cross-modal interaction during pretraining. The method uses relative patch position prediction, encouraging spatial reasoning for localisation rather than reconstruction. We evaluate our approach on the Sen1Floods11 flood mapping dataset, where it significantly outperforms existing reconstruction-based self-supervised learning methods for satellite imagery. Our results demonstrate that position prediction tasks, when properly adapted for multimodal satellite imagery, learn representations more effective for satellite image semantic segmentation than reconstruction-based approaches.

Related papers

Evaluating the Efficacy of Cut-and-Paste Data Augmentation in Semantic Segmentation for Satellite Imagery [4.499833362998487]
This study explores the effectiveness of a Cut-and-Paste augmentation technique for semantic segmentation in satellite images.<n>We adapt this augmentation, which usually requires labeled instances, to the case of semantic segmentation.<n>Using the DynamicEarthNet dataset and a U-Net model for evaluation, we found that this augmentation significantly enhances the mIoU score on the test set from 37.9 to 44.1.
arXiv Detail & Related papers (2024-04-08T17:18:30Z)
SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation [69.42764583465508]
We explore the potential of generative image diffusion to address the scarcity of annotated data in earth observation tasks. To the best of our knowledge, we are the first to generate both images and corresponding masks for satellite segmentation.
arXiv Detail & Related papers (2024-03-25T10:30:22Z)
Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts. We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query. Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z)
SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery [74.82821342249039]
We present SatMAE, a pre-training framework for temporal or multi-spectral satellite imagery based on Masked Autoencoder (MAE) To leverage temporal information, we include a temporal embedding along with independently masking image patches across time.
arXiv Detail & Related papers (2022-07-17T01:35:29Z)
Towards On-Board Panoptic Segmentation of Multispectral Satellite Images [41.34294145237618]
We propose a lightweight pipeline for on-board panoptic segmentation of multi-spectral satellite images. Panoptic segmentation offers major economic and environmental insights, ranging from yield estimation from agricultural lands to intelligence for complex military applications. Our evaluations demonstrate a substantial increase in accuracy metrics compared to the existing state-of-the-art models.
arXiv Detail & Related papers (2022-04-05T03:10:39Z)
Embedding Earth: Self-supervised contrastive pre-training for dense land cover classification [61.44538721707377]
We present Embedding Earth a self-supervised contrastive pre-training method for leveraging the large availability of satellite imagery. We observe significant improvements up to 25% absolute mIoU when pre-trained with our proposed method. We find that learnt features can generalize between disparate regions opening up the possibility of using the proposed pre-training scheme.
arXiv Detail & Related papers (2022-03-11T16:14:14Z)
Leveraging Auxiliary Tasks with Affinity Learning for Weakly Supervised Semantic Segmentation [88.49669148290306]
We propose a novel weakly supervised multi-task framework called AuxSegNet to leverage saliency detection and multi-label image classification as auxiliary tasks. Inspired by their similar structured semantics, we also propose to learn a cross-task global pixel-level affinity map from the saliency and segmentation representations. The learned cross-task affinity can be used to refine saliency predictions and propagate CAM maps to provide improved pseudo labels for both tasks.
arXiv Detail & Related papers (2021-07-25T11:39:58Z)
Context-self contrastive pretraining for crop type semantic segmentation [39.81074867563505]
The proposed Context-Self Contrastive Loss (CSCL) learns an embedding space that makes semantic boundaries pop-up. For crop type semantic segmentation from Satellite Image Time Series (SITS) we find performance at parcel boundaries to be a critical bottleneck. We present a process for semantic segmentation at super-resolution for obtaining crop classes at a more granular level.
arXiv Detail & Related papers (2021-04-09T11:29:44Z)
Spatially Consistent Representation Learning [12.120041613482558]
We propose a spatially consistent representation learning algorithm (SCRL) for multi-object and location-specific tasks. We devise a novel self-supervised objective that tries to produce coherent spatial representations of a randomly cropped local region. On various downstream localization tasks with benchmark datasets, the proposed SCRL shows significant performance improvements.
arXiv Detail & Related papers (2021-03-10T15:23:45Z)
Self-supervised Segmentation via Background Inpainting [96.10971980098196]
We introduce a self-supervised detection and segmentation approach that can work with single images captured by a potentially moving camera. We exploit a self-supervised loss function that we exploit to train a proposal-based segmentation network. We apply our method to human detection and segmentation in images that visually depart from those of standard benchmarks and outperform existing self-supervised methods.
arXiv Detail & Related papers (2020-11-11T08:34:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.