MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning
- URL: http://arxiv.org/abs/2405.02771v1
- Date: Sat, 4 May 2024 23:16:48 GMT
- Title: MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning
- Authors: Vishal Nedungadi, Ankit Kariryaa, Stefan Oehmcke, Serge Belongie, Christian Igel, Nico Lang,
- Abstract summary: We propose a Multi-Pretext Masked Autoencoder (MP-MAE) approach to learn general-purpose representations for optical satellite images.
Our approach builds on the ConvNeXt V2 architecture, a fully convolutional masked autoencoder (MAE)
We find that multi-modal pretraining notably improves the linear probing performance, e.g. 4pp on BigEarthNet and 16pp on So2Sat.
- Score: 9.540487697801531
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The volume of unlabelled Earth observation (EO) data is huge, but many important applications lack labelled training data. However, EO data offers the unique opportunity to pair data from different modalities and sensors automatically based on geographic location and time, at virtually no human labor cost. We seize this opportunity to create a diverse multi-modal pretraining dataset at global scale. Using this new corpus of 1.2 million locations, we propose a Multi-Pretext Masked Autoencoder (MP-MAE) approach to learn general-purpose representations for optical satellite images. Our approach builds on the ConvNeXt V2 architecture, a fully convolutional masked autoencoder (MAE). Drawing upon a suite of multi-modal pretext tasks, we demonstrate that our MP-MAE approach outperforms both MAEs pretrained on ImageNet and MAEs pretrained on domain-specific satellite images. This is shown on several downstream tasks including image classification and semantic segmentation. We find that multi-modal pretraining notably improves the linear probing performance, e.g. 4pp on BigEarthNet and 16pp on So2Sat, compared to pretraining on optical satellite images only. We show that this also leads to better label and parameter efficiency which are crucial aspects in global scale applications.
Related papers
- OmniSat: Self-Supervised Modality Fusion for Earth Observation [5.767156832161819]
We introduce OmniSat, a novel architecture able to merge diverse EO modalities into expressive features without labels.
As demonstrated for three downstream tasks, OmniSat can learn rich representations without supervision, leading to state-of-the-art performances.
Our multimodal pretraining scheme improves performance even when only one modality is available for inference.
arXiv Detail & Related papers (2024-04-12T09:31:55Z) - Rethinking Transformers Pre-training for Multi-Spectral Satellite
Imagery [78.43828998065071]
Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks.
Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data.
In this paper, we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities.
arXiv Detail & Related papers (2024-03-08T16:18:04Z) - SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery [35.550999964460466]
We present SkySense, a generic billion-scale model, pre-trained on a curated multi-modal Remote Sensing dataset with 21.5 million temporal sequences.
To our best knowledge, SkySense is the largest Multi-Modal to date, whose modules can be flexibly combined or used individually to accommodate various tasks.
arXiv Detail & Related papers (2023-12-15T09:57:21Z) - Pink: Unveiling the Power of Referential Comprehension for Multi-modal
LLMs [49.88461345825586]
This paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs.
We present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets.
We show that our model exhibits a 5.2% accuracy improvement over Qwen-VL and surpasses the accuracy of Kosmos-2 by 24.7%.
arXiv Detail & Related papers (2023-10-01T05:53:15Z) - MIMIC: Masked Image Modeling with Image Correspondences [29.8154890262928]
Current methods for building effective pretraining datasets rely on annotated 3D meshes, point clouds, and camera parameters from simulated environments.
We propose a pretraining dataset-curation approach that does not require any additional annotations.
Our method allows us to generate multi-view datasets from both real-world videos and simulated environments at scale.
arXiv Detail & Related papers (2023-06-27T00:40:12Z) - R-MAE: Regions Meet Masked Autoencoders [113.73147144125385]
We explore regions as a potential visual analogue of words for self-supervised image representation learning.
Inspired by Masked Autoencoding (MAE), a generative pre-training baseline, we propose masked region autoencoding to learn from groups of pixels or regions.
arXiv Detail & Related papers (2023-06-08T17:56:46Z) - CSP: Self-Supervised Contrastive Spatial Pre-Training for
Geospatial-Visual Representations [90.50864830038202]
We present Contrastive Spatial Pre-Training (CSP), a self-supervised learning framework for geo-tagged images.
We use a dual-encoder to separately encode the images and their corresponding geo-locations, and use contrastive objectives to learn effective location representations from images.
CSP significantly boosts the model performance with 10-34% relative improvement with various labeled training data sampling ratios.
arXiv Detail & Related papers (2023-05-01T23:11:18Z) - SatMAE: Pre-training Transformers for Temporal and Multi-Spectral
Satellite Imagery [74.82821342249039]
We present SatMAE, a pre-training framework for temporal or multi-spectral satellite imagery based on Masked Autoencoder (MAE)
To leverage temporal information, we include a temporal embedding along with independently masking image patches across time.
arXiv Detail & Related papers (2022-07-17T01:35:29Z) - Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE)
M3AE learns a unified encoder for both vision and language data via masked token prediction.
We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z) - Semantic Labeling of Large-Area Geographic Regions Using Multi-View and
Multi-Date Satellite Images and Noisy OSM Training Labels [0.0]
We present a novel multi-view training framework and CNN architecture for semantically label buildings and roads.
Our approach to multi-view semantic segmentation yields a 4-7% improvement in the per-class IoU scores compared to the traditional approaches.
arXiv Detail & Related papers (2020-08-24T09:03:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.