Related papers: MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning

MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning

URL: http://arxiv.org/abs/2405.02771v1
Date: Sat, 4 May 2024 23:16:48 GMT
Title: MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning
Authors: Vishal Nedungadi, Ankit Kariryaa, Stefan Oehmcke, Serge Belongie, Christian Igel, Nico Lang,
Abstract summary: We propose a Multi-Pretext Masked Autoencoder (MP-MAE) approach to learn general-purpose representations for optical satellite images. Our approach builds on the ConvNeXt V2 architecture, a fully convolutional masked autoencoder (MAE) We find that multi-modal pretraining notably improves the linear probing performance, e.g. 4pp on BigEarthNet and 16pp on So2Sat.
Score: 9.540487697801531
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The volume of unlabelled Earth observation (EO) data is huge, but many important applications lack labelled training data. However, EO data offers the unique opportunity to pair data from different modalities and sensors automatically based on geographic location and time, at virtually no human labor cost. We seize this opportunity to create a diverse multi-modal pretraining dataset at global scale. Using this new corpus of 1.2 million locations, we propose a Multi-Pretext Masked Autoencoder (MP-MAE) approach to learn general-purpose representations for optical satellite images. Our approach builds on the ConvNeXt V2 architecture, a fully convolutional masked autoencoder (MAE). Drawing upon a suite of multi-modal pretext tasks, we demonstrate that our MP-MAE approach outperforms both MAEs pretrained on ImageNet and MAEs pretrained on domain-specific satellite images. This is shown on several downstream tasks including image classification and semantic segmentation. We find that multi-modal pretraining notably improves the linear probing performance, e.g. 4pp on BigEarthNet and 16pp on So2Sat, compared to pretraining on optical satellite images only. We show that this also leads to better label and parameter efficiency which are crucial aspects in global scale applications.

Related papers

A Recipe for Improving Remote Sensing VLM Zero Shot Generalization [0.4427533728730559]
We present two novel image-caption datasets for training of remote sensing foundation models. The first dataset pairs aerial and satellite imagery with captions generated by Gemini using landmarks extracted from Google Maps. The second dataset utilizes public web images and their corresponding alt-text, filtered for the remote sensing domain.
arXiv Detail & Related papers (2025-03-10T21:09:02Z)
On Domain-Specific Post-Training for Multimodal Large Language Models [72.67107077850939]
This paper systematically investigates domain adaptation of MLLMs through post-training. We focus on data synthesis, training pipelines, and task evaluation. We conduct experiments in high-impact domains such as biomedicine, food, and remote sensing.
arXiv Detail & Related papers (2024-11-29T18:42:28Z)
TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
We introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability. To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT. This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process.
arXiv Detail & Related papers (2024-10-14T13:35:47Z)
Large Language Models for Multimodal Deformable Image Registration [50.91473745610945]
We propose a novel coarse-to-fine MDIR framework,LLM-Morph, for aligning the deep features from different modal medical images. Specifically, we first utilize a CNN encoder to extract deep visual features from cross-modal image pairs, then we use the first adapter to adjust these tokens, and use LoRA in pre-trained LLMs to fine-tune their weights. Third, for the alignment of tokens, we utilize other four adapters to transform the LLM-encoded tokens into multi-scale visual features, generating multi-scale deformation fields and facilitating the coarse-to-fine MDIR task
arXiv Detail & Related papers (2024-08-20T09:58:30Z)
Instruction-Guided Visual Masking [25.26544571379426]
Instruction-guided Visual Masking (IVM) is a versatile visual grounding model that is compatible with diverse multimodal models. IVM-enhanced multimodal models can effectively focus on task-relevant image regions to better align with complex instructions.
arXiv Detail & Related papers (2024-05-30T07:48:32Z)
Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery [78.43828998065071]
Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks. Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data. In this paper, we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities.
arXiv Detail & Related papers (2024-03-08T16:18:04Z)
Probing Multimodal Large Language Models for Global and Local Semantic Representations [57.25949445963422]
We study which layers of Multimodal Large Language Models make the most effort to the global image information. In this study, we find that the intermediate layers of models can encode more global semantic information. We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information.
arXiv Detail & Related papers (2024-02-27T08:27:15Z)
u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model [17.3535277338312]
u-LLaVA is an innovative unifying multi-task framework that integrates pixel, regional, and global features to refine the perceptual faculties of MLLMs. This work contributes a novel mask-based multi-task dataset comprising 277K samples, crafted to challenge and assess the fine-grained perception capabilities of MLLMs.
arXiv Detail & Related papers (2023-11-09T13:18:27Z)
MIMIC: Masked Image Modeling with Image Correspondences [29.8154890262928]
Current methods for building effective pretraining datasets rely on annotated 3D meshes, point clouds, and camera parameters from simulated environments. We propose a pretraining dataset-curation approach that does not require any additional annotations. Our method allows us to generate multi-view datasets from both real-world videos and simulated environments at scale.
arXiv Detail & Related papers (2023-06-27T00:40:12Z)
R-MAE: Regions Meet Masked Autoencoders [113.73147144125385]
We explore regions as a potential visual analogue of words for self-supervised image representation learning. Inspired by Masked Autoencoding (MAE), a generative pre-training baseline, we propose masked region autoencoding to learn from groups of pixels or regions.
arXiv Detail & Related papers (2023-06-08T17:56:46Z)
CSP: Self-Supervised Contrastive Spatial Pre-Training for Geospatial-Visual Representations [90.50864830038202]
We present Contrastive Spatial Pre-Training (CSP), a self-supervised learning framework for geo-tagged images. We use a dual-encoder to separately encode the images and their corresponding geo-locations, and use contrastive objectives to learn effective location representations from images. CSP significantly boosts the model performance with 10-34% relative improvement with various labeled training data sampling ratios.
arXiv Detail & Related papers (2023-05-01T23:11:18Z)
Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE) M3AE learns a unified encoder for both vision and language data via masked token prediction. We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z)
Semantic Labeling of Large-Area Geographic Regions Using Multi-View and Multi-Date Satellite Images and Noisy OSM Training Labels [0.0]
We present a novel multi-view training framework and CNN architecture for semantically label buildings and roads. Our approach to multi-view semantic segmentation yields a 4-7% improvement in the per-class IoU scores compared to the traditional approaches.
arXiv Detail & Related papers (2020-08-24T09:03:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.